Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Notes on Artificial Intelligence Last Updated 06.02.2016 Francis Tseng (@frnsys) 2 2 3 CONTENTS Contents Introduction 5 I 7 Foundations 1 Functions 9 1.0.1 Identity functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 1.0.2 The inverse of a function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.0.3 Surjective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.0.4 Injective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.0.5 Surjective & injective functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 1.0.6 Convex and non-convex functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.0.7 Transcendental functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 1.0.8 Logarithms 12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 Other useful concepts 13 2.1 Solving analytically vs numerically . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2 Linear vs nonlinear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Linear Algebra 3.1 15 Vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.1.1 Real coordinate spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.2 Column and row vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.3 Transposing a vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.1.4 Vectors operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 CONTENTS 3 CONTENTS 3.2 3.3 3.4 4 4 3.1.5 Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.1.6 Unit vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1.7 Angles between vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 3.1.8 Perpendicular vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.9 Normal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.10 Orthonormal vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.1.11 Additional vector operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 Linear Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.1 Parametric representations of lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 Linear combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.2.3 Spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3.2.4 Linear independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.1 Matrix operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3.2 Hadamard product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.3.3 The identity matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.4 Diagonal matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 3.3.5 Triangular matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.6 Some properties of matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.7 Matrix inverses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.3.8 Matrix determinants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 3.3.9 Transpose of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.10 Symmetric matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36 3.3.11 The Trace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.12 Orthogonal matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.3.13 Adjoints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 Subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.4.1 Spans and subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 3.4.2 Basis of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.3 Dimension of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.4 Nullspace of a matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.4.5 Columnspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.6 Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 CONTENTS 5 CONTENTS 3.4.7 The standard basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.8 Orthogonal compliments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 3.4.9 Coordinates with respect to a basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.4.10 Orthonormal bases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.1 Linear transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49 3.5.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.1 Image of a subset of a domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.6.2 Image of a subspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.3 Image of a transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 3.6.4 Preimage of a set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 Projections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 3.7.1 Projections onto subspaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 Identifying transformation properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8.1 Determining if a transformation is surjective . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8.2 Determining if a transformation is injective . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8.3 Determining if a transformation is invertible . . . . . . . . . . . . . . . . . . . . . . . . . 57 3.8.4 Inverse transformations of linear transformations . . . . . . . . . . . . . . . . . . . . . . 58 Eigenvalues and Eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 3.9.1 Properties of eigenvalues and eigenvectors . . . . . . . . . . . . . . . . . . . . . . . . . 59 3.9.2 Diagonalizable matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 3.9.3 Eigenvalues & eigenvectors of symmetric matrices . . . . . . . . . . . . . . . . . . . . . . 60 3.9.4 Eigenspace . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.9.5 Eigenbasis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.11 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 3.5 3.6 3.7 3.8 3.9 3.10 Tensors 4 Calculus 4.1 63 Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.1 Computing derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.1.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.3 Differentiation rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 4.1.4 Higher order derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66 CONTENTS 5 CONTENTS 4.2 4.3 4.4 4.5 6 6 4.1.5 Explicit differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.6 Implicit differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 4.1.7 Derivatives of trigonometric functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 4.1.8 Derivatives of exponential and logarithmic functions . . . . . . . . . . . . . . . . . . . . 69 4.1.9 Extreme Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 4.1.10 Rolle’s Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.11 Mean Value Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 4.1.12 L’Hopital’s Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.1.13 Taylor Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.1 Definite integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72 4.2.2 Basic properties of the integral . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.2.3 Mean Value Theorem for Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.4 Antiderivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 4.2.5 The fundamental theorem of calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.2.6 Improper integrals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77 Multivariable Calculus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.1 Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 4.3.2 Partial derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 4.3.3 Directional derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.3.4 Gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85 4.3.5 The Jacobian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.6 The Hessian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86 4.3.7 Scalar and vector fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.8 Divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.3.9 Curl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.3.10 Optimization with eigenvalues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Differential Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 4.4.1 Solving simple differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 4.4.2 Basic first order differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 CONTENTS 7 CONTENTS 5 Probability 95 5.1 Probability space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 5.2 Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 5.3 Joint and disjoint probabilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.4 Conditional Probability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97 5.5 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 5.5.1 99 Conditional Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 The Chain Rule of Probability 5.7 Combinations and Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.8 5.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 5.7.1 Permutations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.7.2 Combinations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.7.3 Combinations, permutations, and probability . . . . . . . . . . . . . . . . . . . . . . . . 101 Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.8.1 Probability Mass Functions (PMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.8.2 Probability Density Functions (PDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 5.8.3 Distribution Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Cumulative Distribution Functions (CDF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 5.9.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.9.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 5.9.3 Using CDFs 5.9.4 Survival function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 5.10 Expected Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.10.1 Discrete random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 5.10.2 Continuous random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.10.3 The expectation rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.10.4 Jensen’s Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.10.5 Properties of expectations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 5.11 Variance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.11.1 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 5.12 Common Probability Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.12.1 Probability mass functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 5.12.2 Probability density functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 5.13 Pareto distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 CONTENTS 7 CONTENTS 8 5.14 Multiple random variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.14.1 Conditional distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.14.2 Multivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 5.15 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.15.1 Intuition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 5.15.2 A Visual Explanation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 5.15.3 An Example Bayes’ Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 5.15.4 Solving the problem with Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.15.5 Another Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.15.6 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 5.16 The log trick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.17 Information Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.17.1 Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 5.17.2 Specific Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.17.3 Conditional Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.17.4 Information Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.17.5 Kullback-Leibler (KL) divergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 5.18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 6 Statistics 139 6.0.1 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1.1 Scales of Measurement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139 6.1.2 Averages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 6.1.3 Population vs Sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.1.4 Independent and Identically Distributed . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 6.1.5 The Law of Large Numbers (LLN) 6.1.6 Regression to the mean . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.1.7 Central Limit Theorem (CLT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 6.1.8 Dispersion (Variance and Standard Deviation) . . . . . . . . . . . . . . . . . . . . . . . . 143 6.1.9 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 6.1.10 Covariance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 6.1.11 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 6.1.12 Degrees of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 8 CONTENTS 9 CONTENTS 6.1.13 Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.1.14 Survival Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 6.2 6.3 Inferential Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 6.2.1 Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2.2 Estimates and estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 6.2.3 Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.2.4 Point Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154 6.2.5 Nuisance Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.2.6 Confidence Intervals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 6.2.7 Kernel Density Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 Experimental Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156 6.3.1 Statistical Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 6.3.2 Sample Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.3 The Null Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 6.3.4 Type 1 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.3.5 P Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 6.3.6 The Base Rate Fallacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3.7 False Discovery Rate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 6.3.8 Alpha Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 6.3.9 The Benjamini-Hochberg Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3.10 Sum of Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 6.3.11 Statistical Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 6.3.12 Effect Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167 6.3.13 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168 6.3.14 Agreement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.4 6.5 Handling Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.4.1 Transforming data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.4.2 Dealing with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 6.4.3 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 CONTENTS 9 CONTENTS 10 7 Bayesian Statistics 7.0.1 7.1 7.3 Frequentist vs Bayesian approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 7.1.1 7.2 Likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 Choosing a prior distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 7.2.1 Conjugate priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.2.2 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 7.2.3 Empirical Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.3.1 Monte Carlo Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 7.3.2 Markov Chains . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.3.3 Markov Chain Monte Carlo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.4 Variational Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.5 Bayesian point estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185 7.6 Credible Intervals (Credible Regions) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.7 Bayesian Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 7.8 A Bayesian example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 7.8.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 8 Graphs 8.1 191 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9 Probabilistic Graphical Models 10 173 197 9.1 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.2 Belief (Bayesian) Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 9.2.1 Conditional independence assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . 207 9.2.2 Properties of belief networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 9.2.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213 9.2.4 Independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.2.5 Conditional independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 9.2.6 Template models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.2.7 Temporal models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 218 9.2.8 Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 9.2.9 Dynamic Bayes Networks (DBNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222 CONTENTS 11 CONTENTS 9.2.10 Plate models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 9.2.11 Structured CPDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 9.2.12 Querying Bayes’s nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 238 9.2.13 Inference in Bayes’ nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 9.3 9.4 Markov Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248 9.3.1 Gibbs distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 9.3.2 Conditional Random Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252 9.3.3 Log-linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 10 Optimization 257 10.0.1 Convex optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 10.0.2 Constrained optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 10.1 Gradient vs non-gradient methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259 10.2.1 Stochastic gradient descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.2.2 Epochs vs iterations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261 10.2.3 Learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 10.2.4 Conditioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 10.3 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262 10.4 Nelder-Mead (aka Simplex or Amoeba optimization) . . . . . . . . . . . . . . . . . . . . . . . . . 263 10.5 Particle Swarm Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264 10.6 Evolutionary Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 10.6.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265 10.6.2 Evolution Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.6.3 Evolutionary Programming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267 10.7 Derivative-Free Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 10.8 Hessian optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268 10.9 Advanced optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 10.10 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269 CONTENTS 11 CONTENTS 12 11 Algorithms 271 11.1 Algorithm design paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 11.2 Algorithmic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272 11.2.1 Asymptotic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 11.2.2 Loop examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 11.2.3 Big-Oh formal definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 274 11.2.4 Big-Omega notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 11.2.5 Big-Theta notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 11.2.6 Little-Oh notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 11.3 The divide-and-conquer paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 11.3.1 The Master Method/Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275 11.4 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 277 11.4.1 Heaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 11.4.2 Balanced Binary Search Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278 11.4.3 Hash Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 11.4.4 Bloom Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 11.5 P vs NP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 282 11.5.1 NP-hard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283 11.5.2 NP-completeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 11.6 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 II Machine Learning 12 Overview 285 287 12.1 Representation vs Learning: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 12.2 Types of learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 12.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 287 13 Supervised Learning 289 13.1 Basic concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 13.1.1 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290 13.1.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 13.2 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 13.2.1 Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 12 CONTENTS 13 CONTENTS 13.2.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294 13.2.3 Normal Equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295 13.2.4 Deciding between Gradient Descent and the Normal Equation . . . . . . . . . . . . . . . 297 13.2.5 Advanced optimization algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 13.3 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 13.3.1 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 13.3.2 Feature engineering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 13.3.3 Scaling (normalization) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299 13.3.4 Mean subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 13.3.5 Dimensionality Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 300 13.3.6 Bagging (“Bootstrap aggregating”) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 13.4 Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 13.4.1 Univariate (simple) Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 13.4.2 How are the parameters determined? . . . . . . . . . . . . . . . . . . . . . . . . . . . . 304 13.4.3 Multivariate linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 306 13.4.4 Example implementation of linear regression with gradient descent . . . . . . . . . . . . . 307 13.4.5 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 13.4.6 Polynomial Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 13.5 Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 13.5.1 One-vs-All . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312 13.6 Softmax regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 13.6.1 Hierarchical Softmax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 13.7 Generalized linear models (GLMs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313 13.7.1 Linear Mixed Models (Mixed Models/Hierarchical Linear Models) . . . . . . . . . . . . . . . 314 13.8 Support Vector Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 13.8.1 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 13.8.2 more on support vector machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 322 13.9 Decision Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 13.9.1 Measures of impurity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 328 13.9.2 Random forests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 13.9.3 Classification loss functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 13.10 Ensemble models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 13.10.1 Boosting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 CONTENTS 13 CONTENTS 14 13.10.2 Stacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 13.11 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 331 13.12 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332 13.12.1 Ridge regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 13.12.2 LASSO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333 13.12.3 Regularized Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 13.12.4 Regularized Logistic Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 334 13.13 Probabilistic modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335 13.13.1 Discriminative vs Generative learning algorithms . . . . . . . . . . . . . . . . . . . . . . 336 13.13.2 Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 336 13.13.3 Expectation Maximization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 338 13.14 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341 14 Neural Nets 14.1 Biological basis 343 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343 14.2 Perceptrons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 14.3 Sigmoid (logistic) neurons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344 14.4 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345 14.4.1 Common activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347 14.4.2 Softmax function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 14.4.3 Radial basis functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351 14.5 Feed-forward neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 14.6 Training neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 14.6.1 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353 14.6.2 Statistical (stochastic) training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 361 14.6.3 Learning rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363 14.6.4 Training algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364 14.6.5 Batch Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368 14.6.6 Cost (loss/objective/error) functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 369 14.6.7 Weight initialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371 14.6.8 Shuffling & curriculum learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372 14.6.9 Gradient noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 14.6.10 Adversarial examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 14.6.11 Gradient Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373 14 CONTENTS 15 CONTENTS 14.6.12 Training tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 14.6.13 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 14.7 Network architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374 14.8 Overfitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 375 14.8.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 378 14.8.2 Artificially expanding the training set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 14.9 Hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382 14.9.1 Choosing hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 14.9.2 Tweaking hyperparameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383 14.10 Deep neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384 14.10.1 Unstable gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385 14.10.2 Rmsprop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386 14.11 Convolutional Neural Networks (CNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 14.11.1 Local receptive fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387 14.11.2 Shared weights . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390 14.11.3 Pooling layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 14.11.4 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392 14.11.5 Training CNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 14.11.6 Convolution kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393 14.12 Recurrent Neural Networks (RNNs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394 14.12.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395 14.12.2 RNN inputs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 14.12.3 Training RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396 14.12.4 LSTMs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397 14.12.5 BI-RNNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400 14.12.6 Attention mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 14.13 Unsupervised neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 14.13.1 Autoencoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 401 14.13.2 Sparse Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 14.13.3 Restricted Boltzmann machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 14.13.4 Deep Belief Nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 14.14 Other neural networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 14.14.1 Modular Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 CONTENTS 15 CONTENTS 16 14.14.2 Recursive Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406 14.14.3 Nonlinear neural nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 14.14.4 Neural Turing Machines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408 14.15 Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410 14.16 Generative Adversarial Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412 14.16.1 Training generative adversarial networks . . . . . . . . . . . . . . . . . . . . . . . . . . 414 14.17 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415 15 Model Selection 419 15.1 Model evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419 15.1.1 Validation vs Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 15.2 Evaluating regression models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421 15.2.1 Residuals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422 15.2.2 Coefficient of determination . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 423 15.3 Evaluating classification models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424 15.3.1 Area under the curve (AUC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425 15.3.2 Confusion Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426 15.3.3 Log-loss . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 15.3.4 F1 score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 427 15.4 Metric selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 428 15.5 Hyperparameter selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429 15.5.1 Grid search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 15.5.2 Random search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 15.5.3 Bayesian Hyperparameter Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . 430 15.5.4 Choosing the Learning Rate α . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432 15.6 CASH . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433 15.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434 15.8 bayes nets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436 16 Bayesian Learning 437 16.1 Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437 16.1.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 16.1.2 Model-based clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438 16.1.3 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 439 16 CONTENTS 17 CONTENTS 16.2 Inference in Bayesian models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 16.2.1 Maximum a posteriori (MAP) estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 16.3 Maximum A Posteriori (MAP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440 16.3.1 Markov Chain Monte Carlo (MCMC) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442 16.4 Nonparametric models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 16.4.1 What is a nonparametric model? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444 16.4.2 An example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446 16.4.3 Parametric models vs nonparametric models . . . . . . . . . . . . . . . . . . . . . . . . 446 16.4.4 Why use a Bayesian nonparametric approach? . . . . . . . . . . . . . . . . . . . . . . . 447 16.5 The Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 16.5.1 Dirichlet Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447 16.5.2 Finite Mixture Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 16.5.3 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 449 16.6 Infinite Mixture Models and the Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . 451 16.6.1 Chinese Restaurant Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 16.6.2 Polya Urn Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 16.6.3 Stick-Breaking Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 452 16.6.4 Dirichlet Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 453 16.7 Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 16.7.1 Model fitting vs Model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 16.7.2 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 454 16.7.3 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 16.7.4 Model averaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 456 16.8 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 457 17 NLP 459 17.1 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 17.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 459 17.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 460 17.4 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 17.4.1 Sentence segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 461 17.4.2 Tokenization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 17.4.3 Normalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 17.4.4 Term Frequency-Inverse Document Frequency (tf-idf) Weighting . . . . . . . . . . . . . . 462 CONTENTS 17 CONTENTS 18 17.4.5 The Vector Space Model (VSM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 462 17.4.6 Normalizing vectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 17.5 Measuring similarity between text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 17.5.1 Minimum edit distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 17.5.2 Jaccard coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463 17.5.3 Euclidean Distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 17.5.4 Cosine similarity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 464 17.6 (Probabilistic) Language Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 17.6.1 A naive method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 17.6.2 A less naive method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465 17.6.3 n-gram Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 467 17.6.4 Log-Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 471 17.6.5 History-based models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 17.6.6 Global Linear Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 17.6.7 Evaluating language models: perplexity . . . . . . . . . . . . . . . . . . . . . . . . . . . 472 17.7 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 17.7.1 Context-free grammars (CFGs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 473 17.7.2 Probabilistic Context-Free Grammars (PCFGs) . . . . . . . . . . . . . . . . . . . . . . . . 475 17.8 Text Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 17.8.1 Naive Bayes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 479 17.8.2 Evaluating text classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 481 17.9 Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 17.9.1 Generative models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 482 17.9.2 Hidden Markov Models (HMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 483 17.9.3 The Viterbi algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484 17.10 Named Entity Recognition (NER) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485 17.11 Relation Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 17.11.1 Ontological Relations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 486 17.11.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 487 17.12 Sentiment Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488 17.12.1 Sentiment Lexicons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 17.12.2 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489 17.13 Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 18 CONTENTS 19 CONTENTS 17.13.1 The general approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 490 17.14 Machine Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 17.14.1 Challenges in machine translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 17.14.2 Classical machine translation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 491 17.14.3 Statistical machine translation methods . . . . . . . . . . . . . . . . . . . . . . . . . . . 492 17.14.4 Phrase-Based Translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496 17.15 Word Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 499 17.16 Neural Networks and NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 17.16.1 Word Embeddings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501 17.16.2 CNNs for NLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502 17.16.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 504 18 Unsupervised Learning 505 18.1 k-Nearest Neighbors (kNN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505 18.2 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 18.2.1 K-Means Clustering Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 506 18.2.2 Hierarchical Agglomerative Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 18.2.3 Affinity Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 508 18.2.4 Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 18.2.5 Mean Shift Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509 18.2.6 Non-Negative Matrix Factorization (NMF) . . . . . . . . . . . . . . . . . . . . . . . . . . . 511 18.2.7 DBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 18.2.8 HDBSCAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512 18.2.9 CURE (Clustering Using Representatives) . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 18.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 517 19 In Practice 19.1 Machine Learning System Design 519 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 519 19.2 Machine learning diagnostics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 19.2.1 Learning curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 520 19.2.2 Important training figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 19.3 Large Scale Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 19.3.1 Map Reduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 19.4 Online (live/streaming) machine learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 CONTENTS 19 CONTENTS 20 19.4.1 Distribution Drift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 19.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522 III Artificial Intelligence 525 19.6 State-space and situation-space representations . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 19.6.1 Search problems (planning) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527 19.6.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528 19.6.3 Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 19.7 Search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529 19.8 Uninformed search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 19.8.1 Exhaustive (“British Museum”) search . . . . . . . . . . . . . . . . . . . . . . . . . . . . 530 19.8.2 Depth-First Search (DFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532 19.8.3 Breadth-First Search (BFS) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 19.8.4 Uniform Cost Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533 19.8.5 Branch & Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 19.8.6 Iterative deepening DFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 19.9 Search enhancements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 19.9.1 Extended list filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534 19.10 Informed (heuristic) search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 19.10.1 Greedy best-first search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535 19.10.2 Beam Search 19.10.3 A* Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 536 19.10.4 Iterative Deepening A* (IDA*) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 19.11 Local search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 19.11.1 Hill-Climbing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537 19.11.2 Other local search algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 19.12 Graph search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 19.12.1 Consistent heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 19.13 Adversarial search (games) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 538 19.13.1 Minimax . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539 19.13.2 Alpha-Beta . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541 19.14 Non-deterministic search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 19.14.1 Expectimax search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 20 CONTENTS 21 CONTENTS 19.14.2 Monte Carlo Tree Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 542 19.14.3 Markov Decision Processes (MDPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 544 19.14.4 Decision Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549 19.15 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 19.15.1 Policy evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 19.15.2 Policy extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552 19.15.3 Policy iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 19.16 Constraint satisfaction problems (CSPs) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 553 19.16.1 Varieties of CSPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 19.16.2 Search formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554 19.16.3 Backtracking search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 555 19.16.4 Iterative improvement algorithms for CSPs . . . . . . . . . . . . . . . . . . . . . . . . . 559 19.17 Online Evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 19.18 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559 20 Planning 561 20.1 An example planning problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 562 20.2 State-space planning vs plan-space (partial-order) planning . . . . . . . . . . . . . . . . . . . . . 563 20.3 State-space planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 563 20.3.1 Representing plans and systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564 20.3.2 STRIPS (Stanford Research Institute Problem Solver) . . . . . . . . . . . . . . . . . . . . 565 20.3.3 Other representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 20.3.4 Applicability and state transitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 567 20.3.5 Searching for plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 570 20.3.6 The FF Planner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 571 20.4 Plan-space (partial-order) planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572 20.4.1 Plan refinement operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573 20.4.2 The Plan-Space Search Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 20.4.3 Threats and flaws . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574 20.4.4 Partial order solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575 20.4.5 The Plan-Space Planning (PSP) algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 575 20.4.6 The UC Partial-Order Planning (UCPoP) Planner . . . . . . . . . . . . . . . . . . . . . . . 578 20.5 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 578 20.5.1 Simple Task Networks (STN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 579 CONTENTS 21 CONTENTS 22 20.5.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580 20.5.3 Planning Domains and Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 581 20.5.4 Planning with task networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 582 20.5.5 Hierarchical Task Network (HTN) planning . . . . . . . . . . . . . . . . . . . . . . . . . . 583 20.6 Graphplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584 20.6.1 Action independence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 587 20.6.2 Independent action execution order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 20.6.3 Layered plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 588 20.6.4 Mutual exclusivity (mutex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 589 20.6.5 Forward planning graph expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 590 20.6.6 Backward graph search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 591 20.7 Other considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 20.7.1 Planning under uncertainty . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593 20.7.2 Planning with time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594 20.7.3 Multi-agent planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 20.7.4 Scheduling: Dealing with resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 595 20.8 Learning plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 20.8.1 Apprenticeship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 20.8.2 Case-Based Goal Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 596 20.9 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 597 21 Reinforcement learning 599 21.1 Model-based learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600 21.1.1 Temporal Difference Learning (TDL or TD Learning) . . . . . . . . . . . . . . . . . . . . . 601 21.1.2 Exploration agent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 21.2 Model-free learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602 21.2.1 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604 21.2.2 Exploration vs exploitation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 21.2.3 Approximate Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606 21.2.4 Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 21.2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 608 21.3 Deep Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 609 21.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 610 22 CONTENTS 23 CONTENTS 22 Filtering 611 22.1 Particle filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 611 22.1.1 DBN particle filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 22.2 Kalman Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 613 22.3 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 623 23 In Practice 625 23.1 Starcraft . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 625 23.2 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626 IV Simulation 627 24 Agent-Based Models 629 24.1 Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 24.1.1 Brownian agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 629 24.2 Multi-task and multi-scale problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 24.3 Utilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 630 24.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 631 25 Nonlinear Dynamics 633 25.1 Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 634 25.1.1 Bifurcations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 25.1.2 Return maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 25.1.3 Bifurcation diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 635 25.1.4 Feigenbaum number . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637 25.1.5 Sensitive to initial conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 25.2 Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638 25.2.1 Ordinary differential equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639 25.2.2 More on ODEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 641 25.2.3 Reminder on distinction b/w difference and differential equations . . . . . . . . . . . . . 643 25.2.4 ODE Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 643 25.2.5 More on stable and unstable manifolds . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 25.2.6 Lyapunov exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645 25.2.7 Sections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 25.2.8 Unstable periodic orbits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 CONTENTS 23 CONTENTS 24 25.3 Nonlinear Time Series Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 646 25.3.1 Delay-Coordinate Embedding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647 25.3.2 Fractal dimensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 650 25.4 Estimating Lyaponuv exponents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 25.4.1 Wolf’s algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 25.4.2 The Kantz algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 652 25.5 Noise filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 25.6 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 653 25.7 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 654 V In Practice 26 Process 655 657 26.1 Data analysis approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 657 26.2 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 26.3 Sampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 26.4 Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 658 26.5 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 26.6 Learning: Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 26.6.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 659 26.7 Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 660 27 Data Visualization 661 27.1 Bivariate charts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 27.2 Histograms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 27.3 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 27.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 661 28 Anonymization 663 28.1 k-anonymity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 28.2 l-diversity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 28.3 t-closeness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 664 28.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665 24 CONTENTS 25 VI CONTENTS Appendices 29 Data analysis with pandas 667 669 29.1 Dealing with datetimes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 29.2 Loading data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674 29.3 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 29.3.1 Initial setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 29.3.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 29.3.3 Plot a cross tab . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 675 29.3.4 Plot subplots as a grid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 29.3.5 Plot overlays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 29.3.6 Other plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 676 29.3.7 Decorating plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 29.3.8 Saving a figure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 29.4 iPython Notebooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 677 CONTENTS 25 CONTENTS 26 26 CONTENTS 27 Introduction These are my personal notes which are broadly intended to cover the basics necessary for data science, machine learning, and artificial intelligence. They have been collected from a variety of different sources, which I include as references when I remember to - so take this as a disclaimer that most of this material is adapted, sometimes directly copied, from elsewhere. Maybe it’s better to call this a “remix” or “katamari” sampled from resources elsewhere. I have tried to give credit where it is due, but sometimes I forget to include all my references, so I will generally just say that I take no credit for any material here. Many of the graphics and illustrations are of my own creation or have been re-created from others, but plenty have also been sourced from elsewhere - again, I have tried to give credit where it is due, but some things slip through. Data science, machine learning, and artificial intelligence are huge fields that share some foundational overlap but go in quite different directions. These notes are not comprehensive but aim to cover a significant portion of that common ground (and a bit beyond too). They are intended to provide intuitive understandings rather than rigorous proofs; if you are interested in those there are many other resources which will help with that. Since mathematical concepts typically have many different applications and interpretations and often are arrived at through different disciplines and perspectives, I try to explain these concepts in as many ways as possible. Some caveats: • These are my personal notes; while I hope that they are helpful, they may not be helpful for you in particular! • This is still very much a work in progress and it will be changing a lot - a lot may be out of order, missing, littered with TO DOs, etc. • These notes are compiled from many sources, so there may be suddens shifts in notation/convention - one day I hope to do a deep pass and fix that, but who knows when that will be : • These notes are generated from markdown files, so they unfortunately lack any snazzy interactivity. I tried many ways to write markdown post-processors to add some, but it’s a big time sink… INTRODUCTION 27 28 The raw notes and graphics are open source - if you encounter errors or have a better way of explanining something, please don’t hesistate to submit a pull request. ~ Francis Tseng ([@frnsys](https://twitter.com/frnsys)) 28 INTRODUCTION 29 Part I Foundations 29 31 1 Functions Fundamentally, a function is a relationship (mapping) between the values of some set X and some set Y : f :X→Y A function can map a set to itself. For example, f (x) = x 2 , also notated f : x 7→ x 2 , is the mapping of all real numbers to all real numbers, or f : R → R. The set you are mapping from is called the domain. The set that is being mapped to is called the codomain. The range is the subset of the codomain which the function actually maps to (a function doesn’t necessarily map to every value in the codomain. But where it does, the range equals the codomain). A function is a mapping between domains. Functions which map to R are known as scalar-valued or real-valued functions. Functions which map to Rn where n > 1 are known as vector-valued functions. 1.0.1 Identity functions An identity function maps something to itself: IX : X → X CHAPTER 1. FUNCTIONS 31 32 That is, for every a in X, IX (a) = a: IX (a) = a, ∀ a ∈ X 1.0.2 The inverse of a function Say we have a function f : X → Y , where f (a) = b for any a ∈ X. We say f is invertible if and only if there exists a function f −1 : Y → X such that f −1 ◦ f = IX and f ◦ f −1 = IY . Note that ◦ denotes function composition, i.e. f ◦ g = f (g), which is the same as f (g(x)). The inverse of a function is unique, that is, it is surjective and injective (described below), that is, there is a unique x for each y . 1.0.3 Surjective functions A surjective function, also called “onto”, is a function f : X → Y where, for every y ∈ Y there exists at least one x ∈ X such that f (x) = y . That is, every y has at least one corresponding x value. This is equivalent to: range(f ) = Y 1.0.4 Injective functions An injective function, also called “one-to-one”, is a function f : X → Y where, for every y ∈ Y , there exists at most one x ∈ X such that f (x) = y . That is, not all y necessarily has a corresponding x, but those that do only have one corresponding x. 1.0.5 Surjective & injective functions A function can be both surjective and injective, which just means that for every y ∈ Y there exists exactly one x ∈ X such that f (x) = y , that is, every y has exactly one corresponding x. As mentioned before, the inverse of a function is both surjective and injective! 32 CHAPTER 1. FUNCTIONS 33 1.0.6 Convex and non-convex functions A convex function is a continuous function whose value at the midpoint of every interval in its domain does not exceed the arithmetic mean of its values at the ends of the interval. (Convex Function. Weisstein, Eric W. Wolfram MathWorld) A convex region is one in which any two points in the region can be joined by a straight line that does not leave the region. Which is to say that a convex function has a minimum, and only one (and this is also the only position where the derivative is 0). More formally, a function is convex if the second derivative is positive everywhere. A function can be convex on a range [a, b] if its second derivative is positive everywhere in that range. In higher dimensions, these derivatives aren’t scalar values, so we instead define convexity if the Hessian H (the matrix of second derivatives) is positive semidefinite (notated H ⪰ 0). It is strictly convex if H is positive definite (notated H ≻ 0). Refer to the Calculus section for more details on this. Convex and non-convex functions 1.0.7 Transcendental functions Transcendental functions are those that are not polynomial, e.g. sin, exp, log, etc. CHAPTER 1. FUNCTIONS 33 34 1.0.8 Logarithms Logarithms are frequently encountered. They have many useful properties, such as turning multiplication into addition: log(xy ) = log(x) + log(y ) Multiplying many small numbers is problematic with computers, leading to underflow errors. Logarithms are commonly used to turn this kind of multiplication into addition and avoid underflow errors. Note that log(x), without any base, typically implies the natural log, i.e. loge (x), sometimes notated ln(x), which has the inverse exp(x), more commonly seen as e x . 34 CHAPTER 1. FUNCTIONS 35 2 Other useful concepts 2.1 Solving analytically vs numerically Often you may see a distinction made between solving a problem analytically (sometimes algebraeically is used) and solving a problem numerically. Solving a problem analytically means you can exploit properties of the objects and equations, e.g. through methods from calculus, avoiding substituting numerical values for the variables you are manipulating (that is, you only need to manipulate symbols). If a problem may be solved analytically, the resulting solution is called a closed form solution (or the analytic solution) and is an exact solution. Not all problems can be solved analytically; generally more complex mathematical models have no closed form solution. These problems are also often the ones of most interest. Such problems need to be approximated numerically, which involves evaluating the equations many times by substituting different numerical values for variables. The result is an approximate (numerical) solution. 2.2 Linear vs nonlinear models You’ll often see a caveat with algorithms that they only work for linear models. On the other hand, some models are touted for their capacity for nonlinear models. A linear model is a model which takes the general form: y = β0 + β1 x1 + · · · + βn xn Note that this function does not need to produce a literal line. The “linear” constraint does not apply to the predictor variables x1 , . . . , xn . For instance, the function y = x 2 is linear. CHAPTER 2. OTHER USEFUL CONCEPTS 35 2.3. METRICS 36 “Linear” refers to the parameters; i.e. the function must be “linear in the parameters”, meaning that the parameters β0 , . . . , βn themselves must form a line (or its equivalent in whatever dimensional space you’re working in). A nonlinear model includes parameters such as β 2 or β0 β1 (that is, multiple parameters in the same term, which is not linear) or transcendental functions. 2.3 Metrics Many artificial intelligence and machine learning algorithms are based on or benefit from some kind of metric. In this context the term has a concrete definition. The typical case for metrics is around similarity. Say you have a bunch of random variables Xi which take on values in a label space V . If Xi and Xj are connected by an edge, we want them to take on “similar” values. How do we define “similar”? We’ll use a distance function µ : V × V → R+ , which needs to satisfy: • reflexivity: µ(v , v ) = 0 for all v • symmetry: µ(v1 , v2 ) = µ(v2 , v1 ) for all v1 , v2 • triangle inequality: µ(v1 , v2 ) ≤ µ(v1 , v3 ) + µ(v3 , v2 ) for all v1 , v2 , v3 If all these are satisfied, we say that µ is a metric. If only reflexivity and symmetry are satisfied, we have a semi-metric instead. So we can create a feature fij (Xi , Xj ) = µ(Xi , Xj ) and then this works out such that: exp(−wij fij (Xi , Xj )), wij > 0 that the lower the distance (metric), the higher the probability. 2.4 References • Convex Function. Weisstein, Eric W. Wolfram MathWorld. 36 CHAPTER 2. OTHER USEFUL CONCEPTS 37 3 Linear Algebra When working with data, we typically deal with many data points consisting of many dimensions. That is, each data point may have a several components; e.g. if people are your data points, they may be represented by their height, weight, and age, which constitutes three dimensions all together. These data points of many components are called vectors. These are contrasted with individual values, which are called scalars. We deal with these data points - vectors - simultaneously in aggregates known as matrices. Linear algebra provides the tools needed to manipulate vectors and matrices. 3.1 Vectors Vectors have magnitude and direction, e.g. 5 miles/hour going east. The magnitude can be thought of, in some sense, as the “length” of the vector (this isn’t quite right however, as there are many concepts of “length” - see norms). Formally, this example would be represented: [ ] ⃗ v= 5 0 since we are “moving” 5 on x-axis and 0 on the y-axis. Note that often the arrow is dropped, i.e. the vector is notated as just v . CHAPTER 3. LINEAR ALGEBRA 37 3.1. VECTORS 3.1.1 38 Real coordinate spaces Vectors are plotted and manipulated in space. A twodimensional vector, such as the previous example, may be represented in a two-dimensional space. A vector with three components would be represented in a three-dimensional space, and so on for any arbitrary n dimensions. A real coordinate space (that is, a space consisting of real numbers) of n dimensions is notated Rn . Such a space encapsulates all possible vectors of that dimensionality, i.e. all possible vectors of the form [v1 , v2 , . . . , vn ]. A vector To denote a vector of n dimensions, we write x ∈ R . n For example: the notation for the two-dimensional real coordinate space is notated R2 , which is all possible realvalued 2-tuples (i.e. all 2D vectors whose components are real numbers, e.g. [1, 2], [−0.4, 21.4], . . . ). If we wanted to describe an arbitrary two-dimensional vector, we could do so with ⃗ v ∈ R2 . 3.1.2 Column and row vectors A vector x ∈ Rn typically denotes a column vector, i.e. with n rows and 1 column. A row vector x T ∈ Rn has 1 row and n columns. The notation x T is described below. 3.1.3 Transposing a vector Transposing a vector means turning its rows into columns: x 1 x [ ] 2 T ,a a⃗ = ⃗ = x x x x 1 2 3 4 x3 x4 So a column vector x can be represented as a row vector with x T . 38 CHAPTER 3. LINEAR ALGEBRA 39 3.1.4 3.1. VECTORS Vectors operations Vector addition Vectors are added by adding the individual corresponding components: [ ] [ ] [ ] [ ] 6 −4 6 + −4 2 + = = 2 4 2+4 6 Multiplying a vector by a scalar To multiply a vector with a scalar, you just multiply the individual components of the vector by the scalar: [ ] [ ] [ ] 2 3×2 6 = = 3 1 3×1 3 This changes the magnitude of the vector, but not the direction. Vector dot products The dot product (also called inner product) of two vec- Example: The red vector is before multiplying tors a⃗, ⃗b ∈ Rn (note that this implies they must be of the a scalar, blue is after. same dimension) is notated: a⃗ · ⃗b It is calculated: a1 b1 n a2 b 2 ∑ a⃗ · ⃗b = · = a b + a b + · · · + a b = ai bi 1 1 2 2 n n ... ... i=1 an bn Which results in a scalar value. Note that sometimes the dot operator is dropped, so a dot product may be notated as just a⃗⃗b. Also note that the dot product of x · y is equivalent to the matrix multiplication x T y . Properties of vector dot products: CHAPTER 3. LINEAR ALGEBRA 39 3.1. VECTORS 40 • Commuative property: The order of the dot product doesn’t matter: a⃗ · ⃗b = ⃗b · a⃗ • Distributive property: You can distribute terms in dot products: (⃗ v +w ⃗)·⃗ x = (⃗ v ·⃗ x +w ⃗ ·⃗ x) • Associative property: (c⃗ v) · w ⃗ = c(⃗ v ·w ⃗) 3.1.5 Norms The norm of a vector x ∈ Rn , denoted ||x||, is the “length” of the vector. That is, norms are a generalization of “distance” or “length”. There are many different norms, the most common of which is the Euclidean norm (also known as the ℓ2 norm), denoted ||x||2 , computed: v u n √ u∑ ||x||2 = t xi2 = x T x i=1 This is the “as-the-crow-flies” distance that we are all familiar with. Generally, a norm is just any function f : Rn → R which satisfies the following properties: 1. 2. 3. 4. non-negativity: For all x ∈ Rn , f (x) ≥ 0 definiteness: f (x) = 0 if and only if x = 0 homogeneity: For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) triangle inequality: For all x, y ∈ Rn , f (x + y ) ≤ f (x) + f (y ) Another norm is the ℓ1 norm: ||x||1 = n ∑ |xi | i=1 and the ℓ∞ norm: ||x||∞ = max |xi | i These three norms are part of the family of ℓp norms, which are parameterized by a real number p ≥ 1 and defined as: ||x||p = ( n ∑ 1 |xi |p ) p i=1 There are also norms for [matrices], the most common of which is the Frobenius norm, analogous to the Euclidean (ℓ2 ) norm for vectors: ||A||Fro v u N M u∑ ∑ =t A2n,m n=1 m=1 40 CHAPTER 3. LINEAR ALGEBRA 41 3.1. VECTORS Lengths and dot products You may notice that the dot product of a vector with itself is the square of that vector’s length: a⃗ · a⃗ = a12 + a22 + · · · + an2 = ||⃗ a||2 So the length (Euclidean norm) of a vector can be written: ||⃗ a|| = 3.1.6 √ a⃗ · a⃗ Unit vectors Each dimension in a space has a unit vector, generally denoted with a hat, e.g. û, which is a vector constrained to that dimension (that is, it has 0 magnitude in all other dimensions), with length 1, e.g. ||û|| = 1. Unit vectors exists for all Rn . The unit vector is also called a normalized vector (which is not to be confused with a normal vector, which is something else entirely.) v is found by computing: The unit vector in the same direction as some vector ⃗ û = ⃗ v ||⃗ v || For instance, in R2 space, we would have two unit vectors: [ ] [ ] 1 0 î = , ĵ = 0 1 In R3 space, we would have three unit vectors: 1 0 0 î = 0 , ĵ = 1 , k̂ = 0 0 0 1 But you can have unit vectors in any direction. Say you have a vector: [ 5 a⃗ = −6 CHAPTER 3. LINEAR ALGEBRA ] 41 3.1. VECTORS 42 You can find a unit vector û in the direction of this vector like so: û = a⃗ ||⃗ a|| so, with our example: [ ] √5 1 1 5 61 û = a⃗ = √ = −6 √ ||⃗ a|| 61 −6 61 3.1.7 Angles between vectors Say you have two non-zero vectors, a⃗, ⃗b ∈ Rn . We often notate the angle between two vectors as θ. The law of cosine tells us, that for a triangle: C 2 = A2 + B 2 − 2AB cos θ Using this law, we can get the angle between our two vectors: ||⃗ a − ⃗b||2 = ||⃗b||2 + ||⃗ a||2 − 2||⃗ a||||⃗b|| cos θ which simplifies to: Angle between two vectors. a⃗ · ⃗b = ||⃗ a||||⃗b|| cos θ There are two special cases if the vectors are collinear, that is if a⃗ = c⃗b: • If c > 0, then θ = 0. • If c < 0, then θ = 180◦ 42 CHAPTER 3. LINEAR ALGEBRA 43 3.1. VECTORS 3.1.8 Perpendicular vectors With the above angle calculation, you can see that if a⃗ and ⃗b are non-zero, and their dot product is 0, that is, a⃗ · ⃗b = ⃗0, then they are perpendicular to each other. Whenever a pair of vectors satisfies this condition a⃗ · ⃗b = ⃗0, it is said that the two vectors are orthogonal (i.e. perpendicular). Note that because any vector times the zero vector equals the zero vector: ⃗0 · ⃗ x = ⃗0. Thus the zero vector is orthogonal to everything. Technical detail: So if the vectors are both non-zero and orthogonal, then the vectors are both perpendicular and orthogonal. But of course, since the zero vector is not non-zero, it cannot be perpendicular to anything, but it is orthogonal to everything. 3.1.9 Normal vectors A normal vector is one which is perpendicular to all the points/vectors on a plane. That is for any vector a⃗ on the plane, and a normal vector, n ⃗, to that plane, we have: n · a⃗ = ⃗0 ⃗ For example: given an equation of a plane, Ax + By + Cz = D, the normal vector is simply: ⃗ n = Aî + B ĵ + C k̂ 3.1.10 Orthonormal vectors Given V = {v⃗1 , v⃗2 , . . . , v⃗k } where: • ||⃗ vi || = 1 for i = 1, 2, . . . , k. That is, the length of each vector in V is 1 (that is, they have all been normalized). • v⃗i · v⃗j = 0 for i ̸= j. That is, these vectors are all orthogonal to each other. This can be summed up as: 0 i= ̸ j v⃗i · v⃗j = 1 i =j This is an orthonormal set. The term comes from the fact that these vectors are all orthogonal to each other, and they have all been normalized. CHAPTER 3. LINEAR ALGEBRA 43 3.1. VECTORS 3.1.11 44 Additional vector operations These vector operations are less common, but included for reference. Vector outer products For the outer product, the two vectors do not need to be of the same dimension (i.e. x ∈ Rn , y ∈ Rm ), and the result is a matrix instead of a scalar: x ⊗ y ∈ Rn×m x1 y1 · · · x1 ym . .. .. . = . . . xn y1 · · · xn ym Note that the outer product x ⊗ y is equivalent to the matrix multiplication xy T . Vector cross products Cross products are much more limited than dot products. Dot products can be calculated for any Rn . Cross products are only defined in R3 . Unlike the dot product, which results in a scalar, the cross product results in a vector which is orthogonal to the original vectors (i.e. it is orthogonal to the plane defined by the two original vectors). a1 b1 ⃗ a⃗ = a2 , b = b2 a3 b3 a2 b 3 − a3 b 2 a⃗ × ⃗b = a3 b1 − a1 b3 a1 b 2 − a2 b 1 For example: 1 5 −7 × 4 − 1 × 2 −30 −7 × 2 = 1 × 5 − 1 × 4 = 1 1 4 1 × 2 − −7 × 5 37 44 CHAPTER 3. LINEAR ALGEBRA 45 3.2. LINEAR COMBINATIONS 3.2 Linear Combinations 3.2.1 Parametric representations of lines Any line in an n-dimensional space can be represented using vectors. Say you have a vector ⃗ v and a set S consisting of all scalar multiplications of that vector (where the scalar c is any real number): S = {c⃗ v | c ∈ R} This set S represents a line, since multiplying a vector with scalars does not changes its direction, only its magnitudes, so that set of vectors covers the entirety of the line. A few of the infinite scalar multiplications which define the line. But that line is around the origin. If you wanted to shift it, you need only to add a vector, which x . So we could define a line as: we’ll call ⃗ L = {⃗ x + c⃗ v | c ∈ R} CHAPTER 3. LINEAR ALGEBRA 45 3.2. LINEAR COMBINATIONS 46 For example: say you are given two vectors: [ ] [ ] 2 ⃗ 0 a⃗ = , b= 1 3 Say you want to find the line that goes through them. First you need to find the vector along that intersecting line, which is just ⃗b − a⃗. Although in standard form, that vector originates at the origin. Thus you still need to shift it by finding the appropriate vector ⃗ x to add to it. But as you can probably see, we can use our a⃗ to shift it, giving us: Calculating the intersecting line for two vectors. L = {⃗ a + c(⃗b − a⃗) | c ∈ R} And this works for any arbitrary n dimensions! (Although in other spaces, this wouldn’t really be a “line”. In R3 space, for instance, this would define a plane.) You can convert this form to a parametric equation, where the equation for a dimension of the vector ai looks like: ai + (bi − ai )c Say you are in R2 so, you might have: {[ ] L= [ ] 0 −2 +c |c ∈ R 3 2 } you can write it as the following parametric equation: x = 0 + −2c = −2c y = 3 + 2c = 2c + 3 3.2.2 Linear combinations Say you have the following vectors in Rm : v⃗1 , v⃗2 , . . . , v⃗n 46 CHAPTER 3. LINEAR ALGEBRA 47 3.2. LINEAR COMBINATIONS A linear combination is just some sum of the combination of these vectors, scaled by arbitrary constants (c1 → cn ∈ R): c1 v⃗1 + c2 v⃗2 , + . . . + cn v⃗n For example: [ ] [ ] 2 ⃗ 0 a⃗ = , b= 1 3 A linear combination would be: [ ] 0 0⃗ a + 0⃗b = 0 Any vector in the space R2 can represented by some linear combination of these two vectors. 3.2.3 Spans The set of all linear combinations for some vectors is called the span. The span of some vectors can define an entire space. For instance, using our previously-defined vectors: span(⃗ a, ⃗b) = R2 But this is not always true for the span of any arbitrary set of vectors. For instance, this does not represent all the vectors in R2 : [ ] [ ] −2 2 span( , ) −2 2 These two vectors are collinear (that is, they lie along the same line), so combinations of them will only yield other vectors along that line. As another example, the span of the zero vector, span(⃗0), cannot represent all vectors in a space. Formally, the span is defined as: span(v⃗1 , v⃗2 , . . . , v⃗n ) = {c1 v⃗1 + c2 v⃗2 , + . . . + cn v⃗n | ci ∈ R ∀ 1 ≤ i ≤ n} CHAPTER 3. LINEAR ALGEBRA 47 3.2. LINEAR COMBINATIONS 3.2.4 48 Linear independence The set of vectors in the previous collinear example: {[ ] [ ]} −2 2 , −2 2 are called a linearly dependent set, which means that some vector in the set can be represented as the linear combination of some of the other vectors in the set. [ ] In this example, we could represent −2, −2 using the linear combination of the other vector, i.e. [ ] −1 2, 2 . You can think of a linearly dependent set as one that contains a redundant vector - one that doesn’t add any more information to the set. As another example: {[ ] [ ] [ ]} 2 7 9 , , 3 2 5 is linearly dependent because v⃗1 + v⃗2 = v⃗3 . Naturally, a set that is not linearly dependent is called a linearly independent set. For a more formal definition of linear dependence, a set of vectors: S = {v⃗1 , v⃗2 , . . . , v⃗n } is linearly dependent iff (if and only if) 0 . . c1 v⃗1 + c2 v⃗2 + . . . + cn v⃗n = ⃗0 = . 0 for some ci ’s where at least one is non-zero. To put the previous examples in context, if you can show that at least one of the vectors can be described by the linear combination of the other vectors in the set, that is: v⃗1 = a2 v⃗2 + a3 v⃗3 + · · · + an v⃗n then you have a linearly dependent set because that can be reduced to show: ⃗0 = −1v⃗1 + a2 v⃗2 + a3 v⃗3 + · · · + an v⃗n 48 CHAPTER 3. LINEAR ALGEBRA 49 3.2. LINEAR COMBINATIONS Thus you can calculate the zero vector as a linear combination of the vectors where at least one constant is non-zero, which satifies the definition for linear dependence. So then a set is linearly independent if, to calculate the zero vector as a linear combination of the vectors, the coefficients must all be zero. Going back to spans, the span of set of size n which is linearly independent can describe that set’s entire space (e.g. Rn ). An example problem: Say you have the set: 1 2 −1 S = −1 , 1 , 0 2 3 2 and you want to know: • does span(S) = R3 ? • is S linearly independent? For the first question, you want to see if any linear combination of the set yields any arbitrary vector in R3 : 1 2 −1 a c1 −1 , c2 1 , c3 0 = b 2 3 2 c You can distribute the coefficients: 1c1 2c2 −1c3 a −1c , 1c , 0c = b 1 2 3 2c1 3c2 2c3 c So you can break that out into a system of equations: c1 + 2c2 − c3 = a −c1 + c2 + 0 = b 2c1 + 3c2 + 2c3 = c And solve it, which gives you: CHAPTER 3. LINEAR ALGEBRA 49 3.3. MATRICES 50 1 (3c − 5a + b) 11 1 c2 = (b + a + c3 ) 3 c1 = a − 2c2 + c3 c3 = So it looks like you can get these coefficients from any a, b, c, so we can say span(S) = R3 . For the second question, we want to see if all of the coefficients have to be non-zero for this to be true: 1 2 −1 0 c1 −1 , c2 1 , c3 0 = 0 2 3 2 0 We can just reuse the previous equations we derived for the coefficients, substituting a = 0, b = 0, c = 0, which gives us: c1 = c2 = c3 = 0 So we know this set is linearly independent. 3.3 Matrices A matrix can, in some sense, be thought of as a vector of vectors. The notation m × n in terms of matrices mean there are m rows and n columns. So that matrix would look like: a11 a21 A= ... am1 a12 . . . a22 . . . .. .. . . am2 . . . a1n a2n .. . amn A matrix of these dimensions may also be notated Rm×n to indicate its membership in that set. We refer to the entry in the i th row and jth column with the notation Aij . 3.3.1 Matrix operations Matrix addition Matrices must have the same dimensions in order to be added (or subtracted). 50 CHAPTER 3. LINEAR ALGEBRA 51 3.3. MATRICES a11 a21 A= ... am1 a12 . . . a22 . . . .. .. . . am2 . . . a1n b11 a2n b21 . , B = .. .. . amn a11 + b11 a21 + b21 A+B= .. . bm1 a12 + b12 a22 + b22 .. . ... ... .. . am1 + bm1 am2 + bm2 . . . b12 . . . b22 . . . .. .. . . bm2 . . . b1n b2n .. . bmn a1n + b1n a2n + b2n .. . amn + bmn A+B=B+A Matrix-scalar multiplication Just distribute the scalar: a11 a21 A= ... am1 a12 . . . a22 . . . .. .. . . am2 . . . a1n ca11 a2n ca21 . , cA = .. .. . amn cam1 ca12 . . . ca22 . . . .. .. . . cam2 . . . ca1n ca2n .. . camn Matrix-vector products To multiply a m × n matrix with a vector, the vector must have n components (that is, the same number of components as there are columns in the matrix, i.e. ⃗ x ∈ Rn ): x1 x2 ⃗ x = ... xn The product would be: A⃗ x = a11 x1 + a12 x2 + · · · + a1n xn a21 x1 + a22 x2 + · · · + a2n xn .. . am1 x1 + am2 x2 + · · · + amn xn This results in a m × 1 matrix. CHAPTER 3. LINEAR ALGEBRA 51 3.3. MATRICES 52 Matrix-vector products as linear combinations If you interpret each column in a matrix A as its own vector v⃗i , such that: [ A = v⃗1 v⃗2 . . . ] v⃗n Then the product of a matrix and vector can be rewritten simply as a linear combination of those vectors: A⃗ x = x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n Matrix-vector products as linear transformations A matrix-vector product can also be seen as a linear transformation. You can describe it as a transformation: [ A = v⃗1 v⃗2 . . . ] v⃗n T : Rn → Rm T (⃗ x ) = A⃗ x It satisfies the conditions for a linear transformation (not shown here), so a matrix-vector product is always a linear transformation. Just to be clear: the transformation of a vector can always be expressed as that vector’s product with some matrix; that matrix is referred to as the transformation matrix. So in the equations above, A is the transformation matrix. To reiterate: • any matrix-vector product is a linear transformation • any linear transformation can be expressed in terms of a matrix-vector product Matrix-matrix products To multiply two matrices, one must have the same number of columns as the other has rows. That is, you can only multiply an m × n matrix with an n × p matrix. The resulting matrix will be of m × p dimensions. That is, if A ∈ Rm×n , B ∈ Rn×p , then C = AB ∈ Rm×p . The resulting matrix is defined as such: Cij = n ∑ Aik Bkj k=1 52 CHAPTER 3. LINEAR ALGEBRA 53 3.3. MATRICES You can break the terms out into individual matrix-vector products. Then you combine the resulting vectors to get the final matrix. More formally, the i th column of the resulting product matrix is obtained by multiplying A with the i th column of B for i = 1, 2, . . . , k. [ ] 1 3 1 3 2 × 0 1 4 0 1 5 2 The product would be: [ ] [ ] [ ] 1 1 3 2 11 × 0 = 4 0 1 9 5 [ ] 3 1 3 2 10 × 1 = 4 0 1 14 2 [ ] ] [ 1 3 1 3 2 11 10 × 0 1 = 4 0 1 9 14 5 2 Properties of matrix multiplication Matrix multiplication is not commutative. That is, for matrices A and B, in general A × B ̸= B × A. They may not even be of the same dimension. Matrix multiplication is associative. For example, for matrices A, B, C, we can say that: A × B × C = A × (B × C) = (A × B) × C There is also an identity matrix I. For any matrix A, we can say that: A×I=I×A=A 3.3.2 Hadamard product The Hadamard product, sometimes called the element-wise product, is another way of multiplying matrices, but only for matrices of the same size. It is usually denoted with ⊙. It is simply: (A ⊙ B)n,m = An,m Bn,m CHAPTER 3. LINEAR ALGEBRA 53 3.3. MATRICES 54 It returns a matrix of the same size as the input matrices. Haramard multiplication has the following propertise: • commutativity: A ⊙ B = B ⊙ A • associativity: A ⊙ (B ⊙ C) = (A ⊙ B) ⊙ C • distributivity: A ⊙ (B + C) = A ⊙ B + A ⊙ C 3.3.3 The identity matrix The identity matrix is an n × n matrix where every component is 0, except for those along the diagonal: 1 0 In = 0 . .. 0 1 0 .. . 0 ... 0 ... 1 ... .. .. . . 0 0 0 ... 0 0 0 .. . 1 When you multiply the identity matrix by any vector: x1 x2 =⃗ In ⃗ x |⃗ x ∈ Rn = x ... xn That is, a vector multiplied by the identity matrix equals itself. 3.3.4 Diagonal matrices A diagonal matrix is a matrix where all non-diagonal elements are 0, typically denoted diag(x1 , x2 , . . . , xn ), where di Dij = 0 i =j i ̸= j So the identity matrix is I = diag(1, 1, . . . , 1). 54 CHAPTER 3. LINEAR ALGEBRA 55 3.3. MATRICES 3.3.5 Triangular matrices We say that a matrix is upper-triangular if all of its elements below the diagonal are zero. Similarly, a matrix is lower-triangular if all of its elements above the diagonal are zero. A matrix is diagonal if it is both upper-triangular and lower-triangular. 3.3.6 Some properties of matrices Associative property (AB)C = A(BC) i.e. it doesn’t matter where the parentheses are. This applies to compositions as well: (h ◦ f ) ◦ g = h ◦ (f ◦ g) Distributive property A(B + C) = AB + AC (B + C)A = BA + CA 3.3.7 Matrix inverses If A is an m × m matrix, and if it has an inverse, then: AA−1 = A−1 A = I Only square matrices can have inverses. An inverse does not exist for all square matrices, but those that have one are called invertible or non-singular, otherwise they are non-invertible or singular. The inverse exists if and only if A is full rank. The invertible matrices A, B ∈ Rn×n have the following properties: • • • • (A−1 )−1 = A If Ax = B, we can multiply by A−1 on both sides to obtain x = A−1 b (AB)−1 = B −1 A−1 (A−1 )T = (AT )−1 ; this matrix is often denoted A−T CHAPTER 3. LINEAR ALGEBRA 55 3.3. MATRICES 56 Pseudo-inverses A† is a pseudo-inverse (sometimes called a Moore-Penrose inverse) of A, which may be nonsquare, if the following are satisfied: • • • • AA† A = A A† AA† = A (AA† )T = AA† (A† A)T = A† A A pseudo-inverse exists and is unique for any matrix A. If A is invertible, A−1 = A† . 3.3.8 Matrix determinants The determinant of a square matrix A ∈ Rn×n is a function det : Rn×n → R, denoted |A|, det(A), or sometimes with the parentheses dropped, det A. A more intuitive interpretation: Say we are in a 2D space and we have some shape. It has some area. Then we apply transformations to that space. The determinant describes how that shape’s area has been scaled as a result of the transformation. This can be extended to 3D, replacing “area” with “volume”. With this interpretation it’s clear that a determinant of 0 means scaling area or volume to 0, which indicates that space has been “compressed” to a line or a point (or a plane in the case of 3D). Inverse and determinant for a 2 × 2 matrix Say you have the matrix: [ a b A= c d ] You can calculate the inverse of this matrix as: [ A −1 1 d −b = ad − bc −c a ] Note that A−1 is undefined if ad − bc = 0, which means that A is not invertible. Intuitively: • the inverse of a matrix essentially “undoes” the transformation that matrix represents 56 CHAPTER 3. LINEAR ALGEBRA 57 3.3. MATRICES • a determinant of 0 implies a transformation that squishes everything together in some way (e.g. into a line). This means that some vectors occupy the same position on the line. • by definition, a function takes one input and maps it to one output. So if we have (what used to be) different vectors mapped to the same position, we can’t take that one same position and re-map it back to different vectors - that would require a function that gives different outputs for the same input. The denominator ad − bc is called the determinant. It is notated as: det(A) = |A| = ad − db Inverse and determinant for an n × n matrix Say we have an n × n matrix A. A submatrix of Aij is an (n − 1) × (n − 1) matrix constructed from A by ignoring the i th row and the j th column of A, which we denote by A¬i,¬j . You can calculate the determinant of an n × n matrix A by using some i th row of A, where 1 ≤ i ≤ n: det(A) = n ∑ (−1)i+j aij det(A¬i,¬j ) j=1 All the det(Aij ) eventually reduce to the determinant of a 2 × 2 matrix. Scalar multiplication of determinants For an n × n matrix A, det(kA) = k n det(A) Determinant of diagonal or triangular matrix The determinant of a diagonal or triangular matrix is simply the product of the elements along its diagonal. Properties of determinants • For A ∈ Rn×n , t ∈ R, multiplying a single row by the scalar t yields a new matrix B, for which |B| = t|A|. • For A ∈ Rn×n , |A| = |AT | CHAPTER 3. LINEAR ALGEBRA 57 3.3. MATRICES • • • • • 58 For A, B ∈ Rn×n , |AB| = |A||B| For A, B ∈ Rn×n , |A| = 0 if A is singular (i.e. non-invertible). 1 For A, B ∈ Rn×n , |A|−1 = |A| if A is non-singular (i.e. invertible). For A, B ∈ Rn×n , if two rows of A are swapped to produce B, then det(A) = − det(B) The determinant of a matrix A ∈ Rn×n is non-zero if and only if it has full rank; this also means you can check if a A is invertible by checking that its determinant is non-zero. 3.3.9 Transpose of a matrix The transpose of a matrix A is that matrix with its columns and rows swapped, denoted AT . More formally, let A be an m × n matrix, and let B = AT . Then B is an n × m matrix, and Bij = Aji . • Transpose of determinants: The determinant of a transpose is the same as the determinant of the original matrix: det(AT ) = det(A) • Transposes of sums: With matrices A, B, C where C = A + B, then C T = (A + B)T = AT + B T • Transposes of inverses: The transpose of the inverse is equal to the inverse of the transpose: (A−1 )T = (AT )−1 • Transposes of multiplication: (AB)T = B T AT • Transpose of a vector: for two column vectors a⃗, ⃗b, we know that a⃗ · ⃗b = ⃗b · a⃗ = a⃗T ⃗b, from x ) · y⃗ = ⃗ x · (AT y⃗ ) (proof omitted). which we can derive: (A⃗ 3.3.10 Symmetric matrices A square matrix A ∈ Rn×n is symmetric if A = AT . It is anti-symmetric if A = −AT . For any square matrix A ∈ Rn×n , the matrix A + AT is symmetric and the matrix A − AT is antisymmetric. Thus any such A can be represented as a sum of a symmetric and an anti-symmetric matrix: A= 1 1 (A + AT ) + (A − AT ) 2 2 Symmetric matrices have many nice properties. The set of all symmetric matrices of dimension n is often denoted as Sn , so you can denote a symmetric n × n matrix A as A ∈ Sn . The quadratic form Given a square matrix A ∈ Rn×n and a vector x ∈ R, the scalar value x T Ax is called a quadratic form: 58 CHAPTER 3. LINEAR ALGEBRA 59 3.3. MATRICES x T Ax = n ∑ n ∑ Aij xi xj i=1 j=1 Here A is typically assumed to be symmetric. Types of symmetric matrices Given a symmetric matrix A ∈ Sn … • A is positive definite (PD) if for all non-zero vectors x ∈ Rn , x T Ax > 0. – This is often denoted A ≻ 0 or A > 0. – The set of all positive definite matrices is denoted Sn++ . • A is positive semidefinite (PSD) if for all vectors x ∈ Rn , x T Ax ≥ 0. – This is often denoted A ⪰ 0 or A ≥ 0. – The set of all positive semidefinite matrices is denoted Sn+ . • A is negative definite (ND) if for all non-zero vectors x ∈ Rn , x T Ax < 0. – This is often denoted A ≺ 0 or A < 0. • A is negative semidefinite (NSD) if for all vectors x ∈ Rn , x T Ax ≤ 0. – This is often denoted A ⪯ 0 or A ≤ 0. • A is indefinite if it is neither positive semidefinite nor negative semidefinite, that is, if there exists x1 , x2 ∈ Rn such that x1T Ax1 > 0 and x2T Ax2 < 0. Some other properties of note: • • • • • If A is positive definite, then −A is negative definite and vice versa. If A is positive semidefinite, then −A is negative semidefinite and vice versa. If A is indefinite, then −A is also indefinite and vice versa. Positive definite and negative definite matrices are always invertible. For any matrix A ∈ Rm×n , which does not need to be symmetric or square, the matrix G = AT A, called a Gram matrix, is always positive semidefinite. – If m ≥ n and A is full rank, then G is positive definite. Essentially, “positive semidefinite” is to matrices as “non-negative” is to scalar values (and “positive definite” as “positive” is to scalar values). CHAPTER 3. LINEAR ALGEBRA 59 3.3. MATRICES 3.3.11 60 The Trace The trace of a square matrix A ∈ Rn×n is denoted tr(A) and is the sum of the diagonal elements in the matrix: tr(A) = n ∑ Aii i=1 The trace has the following properties: • • • • • tr(A) = tr(AT ) For B ∈ Rn×n , tr(A + B) = tr(A) + tr(B) For t ∈ R, tr(tA) = t tr(A) If AB is square, then tr(AB) = tr(BA) If ABC is square, then tr(ABC) = tr(BCA) = tr(CAB) and so on for the product of more matrices 3.3.12 Orthogonal matrix Say we have a n × k matrix C, whose column rows form an orthonormal set. If k = n then C is a square matrix (n × n) and since C’s columns are linearly independent, C is invertible. For an orthonormal matrix: CT C = In −1 C C = In ∴ CT = C−1 When C is an n × n matrix (i.e. square) whose columns form an orthonormal set, we say that C is an orthogonal matrix. Orthogonal matrices have the property of C T C = I = CC T . Orthogonal matrices also have the property that operating on a vector with an orthogonal matrix will not change its Euclidean norm, i.e. ||Cx||2 = ||x||2 for any x ∈ Rn . Orthogonal matrices preserve angles and lengths For an orthogonal matrix C, when you multiply C by some vector, the length and angle of the vector is preserved: ||⃗ x || = ||C⃗ x || cos θ = cos θC 60 CHAPTER 3. LINEAR ALGEBRA 61 3.4. SUBSPACES 3.3.13 Adjoints The classical adjoint, often just called the adjoint of a matrix A ∈ Rn×n is denoted adj(A) and defined as: adj(A)ij = (−1)i+j |A¬j,¬i | Note that the indices are switched in A¬j,¬i . 3.4 Subspaces Say we have set of vectors V which is a subset of Rn , that is, every vector in the set has n components. V is a linear subspace of Rn if • V contains the zero vector ⃗0 • for a vector ⃗ x in V , c⃗ x (where c ∈ R) must also be in V , i.e. closure under scalar multiplcation. • for a vector a⃗ in V and a vector ⃗b in V , a⃗ + ⃗b must also be in V , i.e. closure under addition. A subspace Example: Say we have the set of vectors: {[ S= ] } x1 ∈ R2 : x1 ≥ 0 x2 which is the shaded area below. Is S a subspace of R2 ? • It does contain the zero vector • It is closed under addition: [ ] [ ] [ a c a+c + = b d b+d ] Since a & b are both > 0 (that was a criteria for the set), the a+b will also be greater than 0, so it will also be in the set (there were no constraints on the second component so it doesn’t matter what that is) CHAPTER 3. LINEAR ALGEBRA 61 3.4. SUBSPACES 62 • It is NOT closed under multiplcation: [ ] [ a −a −1 = b −b ] Since a is >= 0, -a will be <= 0, which falls outside the constraints of the set and thus is not contained within the set. So no, this set is not a subspace of R2 . 3.4.1 Spans and subspaces Let’s say we have the set: U = span(v⃗1 , v⃗2 , v⃗3 ) where each vector has n components. Is this a valid subspace of Rn ? Since the span represents all the linear combinations of 2 those vectors, we can define an arbitrary vector in the set Is the shaded set of vectors S a subspace of R ? as: ⃗ x = c1 v⃗1 + c2 v⃗2 + c3 v⃗3 • the set does contain the zero vector: 0v⃗1 + 0v⃗2 + 0v⃗3 = ⃗0 • it is closed under multiplication, since the following is just another linear combination: a⃗ x = ac1 v⃗1 + ac2 v⃗2 + ac3 v⃗3 • it is closed under addition, since if we take another arbitrary vector in the set: y⃗ = d1 v⃗1 + d2 v⃗2 + d3 v⃗3 and add them: 62 CHAPTER 3. LINEAR ALGEBRA 63 3.4. SUBSPACES ⃗ x + y⃗ = (c1 + d1 )v⃗1 + (c2 + d2 )v⃗2 + (c3 + d3 )v⃗3 that’s also just another linear combination in the set. 3.4.2 Basis of a subspace If we have a subspace V = span(S) where the set of vectors S = v⃗1 , v⃗2 , . . . , v⃗n is linearly independent, then we can say that S is a basis for V . A set S is the basis for a subspace V if S is linearly independent and its span defines V . In other words, the basis is the minimum set of vectors that spans the subspace that it is a basis of. All bases for a subspace will have the same number of elements. Intuitively, the basis of a subspace is a set of vectors that can be linearly combined to describe any vector in that subspace. For example, the vectors [0,1] and [1,0] form a basis for R2 . 3.4.3 Dimension of a subspace The dimension of a subspace is the number of elements in a basis for that subspace. 3.4.4 Nullspace of a matrix Say we have: A⃗ x = ⃗0 If you have a set N of all x ∈ Rn that satisfies this equation, do you have a valid subspace? x = ⃗0 this equation is satisfied. So we know the zero vector is part of this set (which is Of course if ⃗ a requirement for a valid subspace). The other two properties (closure under multiplication and addition) necessary for a subspace also hold: A(v⃗1 + v⃗2 ) = Av⃗1 + Av⃗2 = ⃗0 A(c v⃗1 ) = ⃗0 and of course ⃗0 is in the set N. So yes, the set N is a valid subspace, and it is a special subspace: the nullspace of A, notated: N(A) CHAPTER 3. LINEAR ALGEBRA 63 3.4. SUBSPACES 64 That is, the nullspace for a matrix A is the subspace described by the set of vectors which yields the zero vector which multiplied by A, that is, the set of vectors which are the solutions for ⃗ x in: A⃗ x = ⃗0 Or, more formally, if A is an m × n matrix: N(A) = {⃗ x ∈ Rn | A⃗ x = ⃗0} The nullspace for a matrix A may be notated N (A). To put nullspace (or “kernel”) another way, it is space of all vectors that map to the zero vector after applying the transformation the matrix represents. Nullspace and linear independence If you take each column in a matrix A as a vector v⃗i , that set of vectors is linearly independent if the nullspace of A consists of only the zero vector. That is, if: N(A) = {⃗0} The intuition behind this is because, if the linear combination of a set of vectors can only equal the zero vector if all of its coefficients are zero (that is, its coefficients are components of the zero vector), then it is linearly independent: 0 0 x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n = ⃗0 iff ⃗ x = ... 0 Nullity The nullity of a nullspace is its dimension, that is, it is the number of elements in a basis for that nullspace. dim(N(A)) = nullity(N(A)) 64 CHAPTER 3. LINEAR ALGEBRA 65 3.4. SUBSPACES Left nullspace The left nullspace of a matrix A is the nullspace of its transpose, that is N(AT ) : N(AT ) = {⃗ x |⃗ x T A = ⃗0T } 3.4.5 Columnspace Again, a matrix can be represented as a set of column vectors. The columnspace of a matrix (also called the range of the matrix) is all the linear combinations (i.e. the span) of these column vectors: [ A = v⃗1 v⃗2 . . . ] v⃗n C(A) = span(v⃗1 , v⃗2 , . . . , v⃗n ) Because any span is a valid subspace, the columnspace of a matrix is a valid subspace. So if you expand out the matrix-vector product, you’ll see that every matrix-vector product is within that matrix’s columnspace: {A⃗ x |⃗ x ∈ Rn } A⃗ x = x1 v⃗1 + x2 v⃗2 + · · · + xn v⃗n A⃗ x = C(A) That is, for any vector in the space Rn , multiplying the matrix by it just yields another linear combination of that matrix’s column vectors. Therefore it is also in the columnspace. The columnspace (range) for a matrix A may be notated R(A). Rank of a columnspace The column rank of a columnspace is its dimension, that is, it is the number of elements in a basis for that columnspace (i.e. the largest number of columns of the matrix which constitute a linearly independent set): dim(C(A)) = rank(C(A)) Rowspace The rowspace of a matrix A is the columnspace of AT , i.e. C(AT ). The row rank of a matrix is similarly the number of elements in a basis for that rowspace. CHAPTER 3. LINEAR ALGEBRA 65 3.4. SUBSPACES 3.4.6 66 Rank Note that for any matrix A, the column rank and the row rank are equal, so they are typically just referred to as rank(A). The rank has some properties: • • • • For A ∈ Rm×n , rank(A) ≤ min(m, n). If rank(A) = min(m, n), then A is said to be full rank. For A ∈ Rm×n , rank(A) = rank(AT ) For A ∈ Rm×n , B ∈ Rn×p , rank(AB) ≤ min(rank(A), rank(B)) For A, B ∈ Rm×n , rank(A + B) ≤ rank(A) + rank(B) The rank of a transformation refers to the number of dimensions in the output. A matrix is full-rank if it has rank equal to the number of dimensions in its originating space. I.e. they represent a transformation that preserves the dimensionality (it does not collapse it to a lower dimension space). 3.4.7 The standard basis The set of column vectors in an identity matrix In is known as the standard basis for Rn . Each of those column vectors is notated e⃗i . E.g., in an identity matrix, the column vector: 1 0 0 = e⃗1 . .. 0 For a transformation T (⃗ x ), its transformation matrix A can be expressed as: [ A = T (e⃗1 ) T (e⃗2 ) . . . 3.4.8 ] T (e⃗n ) Orthogonal compliments Given that V is some subspace of Rn , the orthogonal compliment of V , notated V ⊥ : V ⊥ = {⃗ x ∈ Rn | ⃗ x ·⃗ v = 0∀⃗ v ∈V} That is, the orthogonal compliment of a subspace V is the set of all vectors where the dot product of each vector with each vector from V is 0, that is where all vectors in the set are orthogonal to all vectors in V . V ⊥ is a subspace (proof omitted). 66 CHAPTER 3. LINEAR ALGEBRA 67 3.4. SUBSPACES Columnspaces, nullspaces, and transposes C(A) is the orthogonal compliment to N(AT ), and vice versa: N(AT ) = C(A)⊥ N(AT )⊥ = C(A) C(AT ) is the orthogonal compliment to N(A), and vice versa. N(A) = C(AT )⊥ N(A)⊥ = C(AT ) As a reminder, columnspaces and nullspaces are spans, i.e. sets of linear combinations, i.e. lines, so these lines are orthogonal to each other. Dimensionality and orthogonal compliments For V , a subspace of Rn : dim(V ) + dim(V ⊥ ) = n (proof omitted here) The intersection of orthogonal compliments Since the vectors between a subspace and its orthogonal compliment are all orthogonal: V ∩ V ⊥ = {⃗0} That is, the only vector which exists both in a subspace and its orthogonal compliment is the zero vector. 3.4.9 Coordinates with respect to a basis With a subspace V of Rn , we have V ’s basis, B, as B = {v⃗1 , v⃗2 , . . . , v⃗k } We can describe any vector a⃗ ∈ V as a linear combination of the vectors in its basis B : CHAPTER 3. LINEAR ALGEBRA 67 3.4. SUBSPACES 68 a⃗ = c1 v⃗1 + c2 v⃗2 + · · · + ck v⃗k We can take these coefficients c1 , c2 , . . . , ck as the coordinates of a⃗ with respect to B, notated as: c1 c2 [⃗ a ]B = ... ck Basically what has happened here is a new coordinate system based off of the basis B is being used. Example [ ] [ ] 2 1 Say we have v⃗1 = , v⃗2 = , where B = {v⃗1 , v⃗2 } is the basis for R2 . 1 2 The point (8, 7) in R2 is equal to 3v⃗1 + 2v⃗2 . If we set: a⃗ = 3v⃗1 + 2v⃗2 Then we can describe a⃗ with respect to B : [ ] 3 2 [⃗ a ]B = which looks like: Change of basis matrix Given the basis: B = {v⃗1 , v⃗2 , . . . , v⃗k } and: c1 c2 [⃗ a ]B = ... ck 68 CHAPTER 3. LINEAR ALGEBRA 69 3.4. SUBSPACES Coordinates wrt a basis say there is some n × k matrix where the column vectors are the basis vectors: C = [v⃗1 , v⃗2 , . . . , v⃗k ] We can do: C[⃗ a]B = a⃗ The matrix C is known as the change of basis matrix and allows us to get a⃗ in standard coordinates. Invertible change of basis matrix Given the basis of some subspace: B = {v⃗1 , v⃗2 , . . . , v⃗k } where v⃗1 , v⃗2 , . . . , v⃗k ∈ Rn , and we have a change of basis matrix: C = [v⃗1 , v⃗2 , . . . , v⃗k ] Assume: CHAPTER 3. LINEAR ALGEBRA 69 3.4. SUBSPACES 70 • C is invertible • C is square (that is, k = n, which implies that we have n basis vectors, that is, B is a basis for Rn ) • C’s columns are linearly independent (which they are because it is formed out of basis vectors, which by definition are linearly independent) Under these assumptions: • If C is invertible, the span of B is equal to Rn . • If the span of B is equal to Rn , C is invertible. Thus: [⃗ a]B = C −1 a⃗ Transformation matrix with respect to a basis Say we have a linear transformation T : Rn → Rn , which we can express as T (⃗ x ) = A⃗ x . This is with respect to the standard basis; we can say A is the transformation for T with respect to the standard basis. Say we have another basis B = {v⃗1 , v⃗2 , . . . , v⃗n } for Rn , that is, it is a basis for for Rn . We could write: [T (⃗ x )]B = D[⃗ x ]B and we call D the transformation matrix for T with respect to the basis B. Then we have (proof omitted): D = C−1 AC where: • D is the transformation matrix for T with respect to the basis B • A is the transformation matrix for T with respect to the standard basis • C is the change of basis matrix for B 3.4.10 Orthonormal bases If B is an orthonormal set, it is linearly independent, and thus it could be a basis. If B is a basis, then it is an orthonormal basis. 70 CHAPTER 3. LINEAR ALGEBRA 71 3.5. TRANSFORMATIONS Coordinates with respect to orthonormal bases Orthonormal bases make good coordinate systems - it is much easier to find [⃗ x ]B if B is an orthonormal basis. It is just: c1 v⃗1 · ⃗ x c2 v⃗2 · ⃗ x = . [⃗ x ]B = ... .. ck v⃗k · ⃗ x Note that the standard basis for Rn is an orthonormal basis. 3.5 Transformations A transformation is just a function which operates on vectors, which, instead of using f , is usually denoted T . 3.5.1 Linear transformations A linear transformation is a transformation: T : Rn → Rm where we can take two vectors a⃗, ⃗b ∈ Rn and the following conditions are satisfied: T (⃗ a + ⃗b) = T (⃗ a) + T (⃗b) T (c a⃗) = cT (⃗ a) Put another way, a linear transformation is a transformation in which lines are preserved (they don’t become curves) and the origin remains at the origin. This can be thought of as transforming space such that grid lines remain parallel and evenly-spaced. A linear transformation of a space can be described in terms of transformations of the space’s basis vectors, e.g. î, ĵ. For example, if the basis vectors î, ĵ end up at [a, c], [b, d] respectively, an arbitrary vector [x, y ] would be transformed to: [ ] [ ] [ a b ax + by x +y = c d cx + dy ] which is equivalent to: CHAPTER 3. LINEAR ALGEBRA 71 3.5. TRANSFORMATIONS 72 [ a b c d ][ ] x y In this way, we can think of matrices as representing a transformation of space and the transformation itself as a product with that matrix. Extending this further, you can think of matrix multiplication as a composition of transformations - each matrix represents one transformation; the resulting matrix product is a composition of those transformations. The matrix does not have to be square; i.e. it does not have to share dimensionality (in terms of the matrix’s rows) with the space it’s being applied to. The resulting transformation will have different dimensions. For example, a 3x2 matrix will transform a 2D space to a 3D space. Linear transformation examples These examples are all in R2 since it’s easier to visualize. But you can scale them up to any Rn . Reflection To get from the triangle on the left and reflect it over the y -axis to get the triangle on the right, all you’re doing is changing the sign of all the x values. So a transformation would look like: [ ] [ x −x T( )= y y ] An example of reflection. Scaling Say you want to double the size of the triangle instead of flipping it. You’d just scale up all of its values: [ ] [ x 2x T( )= y 2y 72 ] CHAPTER 3. LINEAR ALGEBRA 73 3.6. IMAGES Compositions of linear transformation The composition of linear transformations S(⃗ x ) = A⃗ x and T (⃗ x ) = B⃗ x is denoted: T ◦ S(⃗ x ) = T (S(⃗ x )) This is read: “the composition of T with S”. If T : Y → Z and S : X → Y , then T ◦ S : X → Z. A composition of linear transformations is also a linear transformation (proof omitted here). Because of this, this composition can also be expressed: ⃗ = C⃗ T ◦ S(X) x Where C = BA (proof omitted), so: ⃗ = BA⃗ T ◦ S(X) x 3.5.2 Kernels The kernel of T, denoted ker(T ), is all of the vectors in the domain such that the transformation of those vectors is equal to the zero vector: ker(T ) = {⃗ x ∈ Rn | T (⃗ x ) = {⃗0}} You may notice that, because T (⃗ x ) = A⃗ x, ker(T ) = N(A) That is, the kernel of the transformation is the same as the nullspace of the transformation matrix. 3.6 Images 3.6.1 Image of a subset of a domain When you pass a set of vectors (i.e. a subset of a domain Rn ) through a transformation, the result is called the image of the set under the transformation. E.g. T (S) is the image of S under T . CHAPTER 3. LINEAR ALGEBRA 73 3.6. IMAGES 74 For example, say we have some vectors which define the triangle on the left. When a transformation is applied to that set, the image is the result on the right. Another example: if we have a transformation T : X → Y and A which is a subset of T , then T (A) is the image of A under T , which is equivalent to the set of transformations for each vector in A : T (A) = {T (⃗ x) ∈ Y | ⃗ x ∈ A} Example: Image of a triangle. T : X → Y , where A ⊆ X. Images describe surjective functions, that is, a surjective function f : X → Y can also be written: im(f ) = Y since the image of the transformation encompasses the entire codomain Y . 3.6.2 Image of a subspace The image of a subspace under a transformation is also a subspace. That is, if V is a subspace, T (V ) is also a subspace. 3.6.3 Image of a transformation If, instead of a subset or subspace, you take the transformation of an entire space, i.e. T (Rn ), the terminology is different: that is called the image of T , notated im(T ). Because we know matrix-vector products are linear transformations: 74 CHAPTER 3. LINEAR ALGEBRA 75 3.7. PROJECTIONS T (⃗ x ) = A⃗ x The image of a linear transformation matrix A is equivalent to its column space, that is: im(T ) = C(A) 3.6.4 Preimage of a set The preimage is the inverse image. For instance, consider a transformation mapping from the domain X to the codomain Y : T :X→Y And say you have a set S which is a subset of Y . You want to find the set of values in X which map to S, that is, the subset of X for which S is the image. For a set S, this is notated: T −1 (S) Note that not every point in S needs to map back to X. That is, S may contain some points for which there are no corresponding points in X. Because of this, the image of the preimage of S is not necessarily equivalent to S, but we can be sure that it is at least a subset: T (T −1 (S)) ⊆ S 3.7 Projections A projection can kind of be thought of as a “shadow” of a vector: Alternatively, it can be thought of answering “how far does one vector go in the direction of another vector?”. In the accompanying figure, the projection of b onto a tells us how far b goes in the direction of a. The projection of ⃗ x onto line L is notated: ProjL (⃗ x) CHAPTER 3. LINEAR ALGEBRA 75 3.7. PROJECTIONS 76 Here, we have the projection of ⃗ x - the red vector - onto the green line L. The projection is the dark red vector. This 2 example is in R but this works in any Rn . More formally, a projection of a vector ⃗ x onto a line L is some vector in L where ⃗ x − ProjL (⃗ x ) is orthogonal to L. A line can be expressed as the set of all scalar multiples of a vector, i.e: L = {c⃗ v | c ∈ R} So we know that “some vector in L” can be represented as c⃗ v : ProjL (⃗ x ) = c⃗ v By our definition of a projection, we also know that ⃗ x − ProjL (⃗ x ) is orthogonal to L, which can now be rewritten as: (⃗ x − c⃗ v) · ⃗ v = ⃗0 (This is the definition of orthogonal vectors.) 76 CHAPTER 3. LINEAR ALGEBRA 77 3.7. PROJECTIONS Written in terms of c, this simplifies down to: c= ⃗ x ·⃗ v ⃗ v ·⃗ v So then we can rewrite: ProjL (⃗ x) = ⃗ x ·⃗ v ⃗ v ⃗ v ·⃗ v ProjL (⃗ x) = ⃗ x ·⃗ v ⃗ v ||⃗ v ||2 or, better: And you can pick whatever vector for ⃗ v so long as it is part of line L. v is a unit vector, then the projection is simplified even further: However, if ⃗ ProjL (⃗ x ) = (⃗ x · û)û Projections are linear transformations (they satisfy the requirements, proof omitted), so you can represent them as matrix-vector products: ProjL (⃗ x ) = A⃗ x where the transformation matrix A is: [ u12 u2 u1 A= u1 u2 u22 ] where ui are components of the unit vector. Also note that the length of a projection (i.e. the scalar component of the projection) is given by the dot product of the two vectors. For example, in the accompanying figure, the length of Proja⃗(⃗b) is a · b. 3.7.1 Projections onto subspaces Given that V is a subspace of Rn , we know that V ⊥ is also a subspace of Rn , and we have a vector ⃗ x such that ⃗ x ∈ Rn , we know that ⃗ x =⃗ v +w ⃗ where ⃗ v ∈ V and w ⃗ ∈ V ⊥ , then: CHAPTER 3. LINEAR ALGEBRA 77 3.7. PROJECTIONS 78 ProjV ⃗ x =⃗ v ProjV ⊥ ⃗ x =w ⃗ Reminder: a projection onto a subspace is the same as a projection onto a line (a line is a subspace): ProjV ⃗ x= ⃗ x ·⃗ v ⃗ v ⃗ v ·⃗ v where: V = span(⃗ v) V = {c⃗ v | c ∈ R} So ProjV ⃗ x is the unique vector ⃗ v ∈ V such that ⃗ x =⃗ v +w ⃗ where w ⃗ is a unique member of V ⊥ . Projection onto a subspace as a linear transform ProjV (⃗ x ) = A(AT A)−1 AT ⃗ x where V is a subspace of Rn . Note that A(AT A)−1 AT is just some matrix, which we can call B, so this is in the form of a linear transform, B⃗ x. Also ⃗ v = ProjV ⃗ x , so: ⃗ x = ProjV ⃗ x +w ⃗ where w ⃗ is a unique member of V ⊥ . Projections onto subspaces with orthonormal bases Given that V is a subspace of Rn and B = {v⃗1 , v⃗2 , . . . , v⃗k } is an orthonormal basis for V . We have a vector ⃗ x ∈ Rn , so ⃗ x =⃗ v +w ⃗ where ⃗ v ∈ V and w ⃗ ∈ V ⊥. We know (see previously) that by definition: ProjV (⃗ x ) = A(AT A)−1 AT ⃗ x which is quite complicated. It is much simpler for orthonormal bases: ProjV (⃗ x ) = AAT ⃗ x 78 CHAPTER 3. LINEAR ALGEBRA 79 3.8. IDENTIFYING TRANSFORMATION PROPERTIES 3.8 Identifying transformation properties 3.8.1 Determining if a transformation is surjective A transformation T (⃗ x ) = A⃗ x is surjective (“onto”) if the column space of A equals the codomain: span(a1 , a2 , . . . , an ) = C(A) = Rm which can also be stated as: rank(A) = m 3.8.2 Determining if a transformation is injective A transformation T (⃗ x ) = A⃗ x is injective (“one-to-one”) if the the nullspace of A contains only the zero vector: N(A) = {⃗0} which is true if the set of A’s column vectors is linearly independent. This can also be stated as: rank(A) = n 3.8.3 Determining if a transformation is invertible A transformation is invertible if it is both injective and surjective. For a transformation to be surjective: rank(A) = m And for a transformation to be surjective: rank(A) = n CHAPTER 3. LINEAR ALGEBRA 79 3.9. EIGENVALUES AND EIGENVECTORS 80 Therefore for a transformation to be invertible: rank(A) = m = n So the transformation matrix A must be a square matrix. 3.8.4 Inverse transformations of linear transformations Inverse transformations are linear transformations if the original transformation is both linear and invertible. That is, if T is invertible and linear, T −1 is linear: T −1 (⃗ x ) = A−1⃗ x (T −1 ◦ T )(⃗ x ) = A−1 A⃗ x = In ⃗ x = AA−1⃗ x = (T ◦ T −1 )(⃗ x) 3.9 Eigenvalues and Eigenvectors Say we have a linear transformation T : Rn → Rn : T (⃗ v ) = A⃗ v = λ⃗ v That is, ⃗ v is scaled by a transformation matrix λ. We say that: • ⃗ v is the eigenvector for T • λ is the eigenvalue associated with that eigenvector Eigenvectors are vectors for which matrix multiplication is equivalent to only a scalar multiplication, nothing more. λ, the eigenvalue, is the scalar that the transformation matrix A is equivalent to. Another way to put this: given a square matrix A ∈ Rn×n , we say λ ∈ C is an eigenvalue of A and x ∈ Cn is the corresponding eigenvector if: Ax = λx, x ̸= 0 Note that C refers to the set of complex numbers. So this means that multiplying A by x just results in a new vector which points in the same direction has x but scaled by a factor λ. 80 CHAPTER 3. LINEAR ALGEBRA 81 3.9. EIGENVALUES AND EIGENVECTORS For any eigenvector x ∈ Cn and a scalar t ∈ C, A(cx) = cAx = cλx = λ(cx), that is, cx is also an eigenvector - but when talking about “the” eigenvector associated with λ, it is assumed that the eigenvector is normalized to length 1 (though you still have the ambiguity that both x and −x are eigenvectors in this sense). Eigenvalues and eigenvectors come up when maximizing some function of a matrix. So what are our eigenvectors? What ⃗ v satisfies: A⃗ v = λ⃗ v, ⃗ v ̸= 0 We can do: A⃗ v = λ⃗ v ⃗0 = λ⃗ v − A⃗ v We know that ⃗ v = In ⃗ v , so we can do: ⃗0 = λIn ⃗ v − A⃗ v = (λIn − A)⃗ v The first term, λIn − A, is just some matrix which we can call B, so we have: ⃗0 = B⃗ v which, by our definition of nullspace, indicates that ⃗ v is in the nullspace of B. That is: ⃗ v ∈ N(λIn − A) 3.9.1 Properties of eigenvalues and eigenvectors ∑ The trace of A is equal to the sum of its eigenvalues: tr(A) = ni=1 λi . ∏ The determinant of A is equal to the product of its eigenvalues: |A| = ni=1 λi . The rank of A is equal to the number of non-zero eigenvalues of A. If A is non-singular, then λ1i is an eigenvalue of A−1 with associated eigenvector xi , i.e. A−1 xi = ( λ1i )xi . • The eigenvalues of a diagonal matrix D = diag(d1 , . . . , dn ) are just the diagonal entries d1 , . . . , d n . • • • • CHAPTER 3. LINEAR ALGEBRA 81 3.9. EIGENVALUES AND EIGENVECTORS 3.9.2 82 Diagonalizable matrices All eigenvector equations can be written simultaneously as: AX = XΛ where the columns of X ∈ Rn×n are the eigenvectors of A and Λ is a diagonal matrix whose entries are the eigenvalues of A, i.e. Λ = diag(λ1 , . . . , λn ). If the eigenvectors of A are linearly independent, then the matrix X will be invertible, so that A = XΛX −1 . A matrix that can be written in this form is called diagonalizable. 3.9.3 Eigenvalues & eigenvectors of symmetric matrices For a symmetric matrix A ∈ Sn , the eigenvalues of A are all real and the eigenvectors of A are all orthonormal. If for all of A’s eigenvalues λi … • • • • • λi > 0, then A is positive definite. λi ≥ 0, then A is positive semidefinite. λi < 0, then A is negative definite. λi ≤ 0, then A is negative semidefinite. have both positive and negative values, then A is indefinite. Example Say we have a linear transformation T (⃗ x ) = A⃗ x . Here are some example values of ⃗ x being input and the output vectors they yield (it’s not important here what A actually looks like, its just to help distinguish what is and isn’t an eigenvector.) [ ] [ ] 1 1 • A = 0 1 – ⃗ x is not an eigenvector, it was not merely scaled by A. [ ] [ ] 4 8 • A = 7 14 – ⃗ x is an eigenvector, it was only scaled by A. This is a simple example where the vector was scaled up by 2, so the eigenvalue here is 2. 82 CHAPTER 3. LINEAR ALGEBRA 83 3.10. TENSORS 3.9.4 Eigenspace The eigenvectors that correspond to an eigenvalue λ form the eigenspace for that λ, notated Eλ : Eλ = N(λIn − A) 3.9.5 Eigenbasis Say we have an n × n matrix A. An eigenbasis is a basis for Rn consisting entirely of eigenvectors for A. 3.10 Tensors Tensors are generalizations of scalars, vectors, and matrices. A tensor is distinguished by its rank, which is the number of indices it has. A scalar is a 0th -rank tensor (it has no indices), a vector is a 1th -rank tensor, i.e. its components are accessed by one index, e.g. xi , and a matrix is a 2th -rank tensor, i.e. its components are accessed by two indices, e.g. Xi,j , and so on. Just as we have scalar and vector fields, we also have tensor fields. 3.11 References • Math for Machine Learning. Hal Daumé III. August 28, 2009. • Tensor. Rowland, Todd and Weisstein, Eric W. Wolfram MathWorld. • Essence of Linear Algebra. 3Blue1Brown. CHAPTER 3. LINEAR ALGEBRA 83 3.11. REFERENCES 84 84 CHAPTER 3. LINEAR ALGEBRA 85 4 Calculus 4.1 Differentiation The slope of a two-dimensional function (in higher dimensions, the term gradient is used instead of “slope”; in particular, the gradient is the vector of partial derivatives) can be thought of as the rate of change for that function. For a linear function f (x) = ax + b, a is the slope; it is constant for all x throughout the function. But for non-linear functions, e.g. f (x) = 3x 2 , the slope varies along with x. Differentation is a way to find another function, called the derivative of the original function, that gives us the rate of change (slope) of one variable with respect to another variable. It tells us how to change the input in order to get a change in the output: f (x + ϵ) ≈ f (x) + ϵf ′ (x) This will become useful later on - many machine learning training methods use derivatives (in particular, multidimensional partial derivatives, i.e. gradients) to determine how to update weights (inputs) in order to reduce error (the output). 4.1.1 Computing derivatives Say that we want to compute the rate of change (slope) at a single point. How? It takes two points to define a line, which we can easily compute the slope for. Instead of a single point, we can consider two points that are very, very close together: (x, f (x))and(x + h, f (x + h)) CHAPTER 4. CALCULUS 85 4.1. DIFFERENTIATION 86 Note that sometimes δ or δx is used instead of h. Their slope is then given by: f (x + h) − f (x) f (x + h) − f (x) = x +h−x h We want the two points as close as possible, so we can look at the limit of h → 0: f (x + h) − f (x) h→0 h lim That is the derivative of f : f (x + h) − f (x) h→0 h f ′ (x) = lim If this limit exists, we say that f is differentiable at x and that its derivative at x is f ′ (x). Example We have a car and have a variable x which describes its position at a given point in time t. That is, f (t) = x. With differentiation we can get dx dt which is the rate of change of the car’s position wrt to time, i.e. the speed (velocity) of the car. ∆x Note that this is not the same as ∆y , which gives us the change in x over a time interval ∆t. This is the average velocity over that time interval. If instead we want instantaneous velocity - the velocity at a given point in time - we need to have the time interval ∆t approach 0 (we can’t set ∆t to 0 because then we have division by 0). This is equivalent to the derivative described previously: lim ∆t→0 dx ∆x = ∆t dt This can be read as • “the rate of change in x with respect to t”, or • “an infinitesimal value of y divided by an infinitesimal value of x” For a given function f (x), this can also be written as: d f (x + ∆) − f (x) f (x) = lim ∆→0 dx ∆ 86 CHAPTER 4. CALCULUS 87 4.1. DIFFERENTIATION 4.1.2 Notation A derivative of a function y = f (x) may be notated: • f ′ (x) • Dx [f (x)] • Df (x) • dy dx d • dx [y ] As a special case, if we are looking at a variable with respect to time t, we can use Newton’s dot notation: • f˙ = 4.1.3 df dt Differentiation rules • Derivative of a constant function: For any fixed real number c, d dx [c] = 0. – This is because a constant function is just a horizontal line (it has a slope of 0). • • • • • • d Derivative of a linear function: For any fixed real numbers m and c, dx [mx + c] = m d d Constant multiple rule: For any fixed real number c, dx [cf (x)] = c dx [f (x)] d d d Addition rule: dx [f (x) ± g(x)] = dx [f (x)] ± dx [g(x)] d The power rule: dx [x n ] = nx n−1 d Product rule: dx [f (x) · g(x)] = f (x) · g ′ (x) + f ′ (x) · g(x) ′ (x)g ′ (x) d f (x) Quotient rule: dx [ g(x) ] = g(x)f (x)−f g(x)2 Example d [6x 5 + 3x 2 + 3x + 1] dx • • • • • d d d d Apply the addition rule: dx [6x 5 ] + dx [3x 2 ] + dx [3x] + dx [1] d d Apply the linear and constant rules: dx [6x 5 ] + dx [3x 2 ] + 3 + 0 d d [x 5 ] + 3 dx [x 2 ] + 3 Apply the constant multiplier rule: 6 dx Then the power rule: 6(5x 4 ) + 3(2x) + 3 And finally: 30x 4 + 6x + 3 CHAPTER 4. CALCULUS 87 4.1. DIFFERENTIATION 88 Chain rule If a function f is composed of two differentiable functions y (x) and u(x), so that f (x) = y (u(x)), then f (x) is differentiable and: df dy du = · dx du dx This rule can be applied sequentially to nestings (compositions) of many functions: f (g(h(x))) df df dg dh = · · dx dg dh dx The chain rule is very useful when you recompose functions in terms of nested functions. Example Given f (x) = (x 2 + 1)3 , we can define another function u(x) = x 2 + 1, thus we can rewrite f (x) in terms of u(x), that is: f (x) = u(x)3 . • • • • df df We can apply the chain rule: dx = du · du dx . df d d 3 2 Then substitute: dx = du [u ] · dx (x + 1). df Then we can just apply the rest of our rules: dx = 3u 2 · 2x. df Then substitute again: dx = 3(x 2 + 1)2 · 2x and simplify. 4.1.4 Higher order derivatives The derivative of a function as described above is the first derivative. The second derivative, or second order derivative, is the derivative of the first derivative, denoted f ′′ (x). There’s also the third derivative, f ′′′ (x), and a fourth, and so on. Any derivative beyond the first is a higher order derivative. Notation The above notation gets unwieldy, so there are alternate notations. For the nth derivative: • f (n) (x) (this is to distinguish from f n (x) which is the quantity f (x) raised to the nth power) dnf • dx n (Leibniz notation) dn • dx n [f (x)] (another form of Leibniz notation) • Dn f (Euler’s notation) 88 CHAPTER 4. CALCULUS 89 4.1.5 4.1. DIFFERENTIATION Explicit differentiation When dealing with multiple variables, there is sometimes the option of explicit differentiation. This simply involves expressing one variable in terms of the other. For example: x 2 + y 2 = 1. This can be rewritten in terms of x like so: y = ±(1 − x 2 )1/2 . Here it is easy to apply the chain rule: u(x) = 1 − x 2 y = u(x)1/2 dy d 1/2 d = [u ] · [1 − x 2 ] dx du dx d 1/2 d d 2 = [u ] · ( [1] − [x ]) du dx dx d 1/2 d = [u ] · (− [x 2 ]) du dx d 1/2 = [u ] · (−2x) du 1 = u −1/2 · (−2x) 2 1 = (1 − x 2 )−1/2 · (−2x) 2 = −x(1 − x 2 )−1/2 x =− (1 − x 2 )1/2 x =− y 4.1.6 Implicit differentiation Implicit differentiation is useful for differentiating equations which cannot be explicitly differentiated because it is impossible to isolate variables. With implicit differentiation, you do not need to define one of the variables in terms of the other. For example, using the same equation from before: x 2 + y 2 = 1. First, differentiate with respect to x on both sides of the equation: d 2 d [x + y 2 ] = [1] dx dx d 2 [x + y 2 ] = 0 dx d 2 d 2 [x ] + [y ] = 0 dx dx To differentiate d 2 dx [y ], CHAPTER 4. CALCULUS we can define a new function f (y (x)) = y 2 and then apply the chain rule: 89 4.1. DIFFERENTIATION 90 df df dy d 2 dy = · = [y ] · = 2y · y ′ dx dy dx dy dx So returning to our other in-progress derivative: d 2 d 2 [x ] + [y ] = 0 dx dx We can substitute and bring it to completion: d 2 d 2 [x ] + [y ] = 0 dx dx d 2 [x ] + 2y y ′ = 0 dx 2x + 2y y ′ = 0 2y y ′ = −2x 2x y′ = − 2y x y′ = − y 4.1.7 Derivatives of trigonometric functions d sin(x) = cos(x) dx d cos(x) = − sin(x) dx d tan(x) = sec2 (x) dx d sec(x) = sec(x) tan(x) dx d csc(x) = − csc(x) cot(x) dx d cot(x) = − csc2 (x) dx 1 d arcsin(x) = √ dx 1 − x2 d −1 arccos(x) = √ dx 1 − x2 d 1 arctan(x) = dx 1 + x2 90 CHAPTER 4. CALCULUS 91 4.1.8 4.1. DIFFERENTIATION Derivatives of exponential and logarithmic functions d x e = ex dx d x a = ln(a)ax dx d 1 ln(x) = dx x d 1 logb (x) = dx x ln(b) 4.1.9 Extreme Value Theorem A global maximum (or absolute maximum) of a function f on a closed interval I is a value f (c) such that f (c) ≥ f (x) for all x in I. A global minimum (or absolulte minimum) of a function f on a closed interval I is a value f (c) such that f (c) ≤ f (x) for all x in I. The extreme value theorem states that if f is a function that is continuous on the closed interval [a, b], then f has both a global minimum and a global maximum on [a, b]. It is assumed that a and b are both finite. Extrema and inflection/inflexion points Note that at any extremum (i.e. a minimum or a maximum), global or local, the slope is 0 because the graph stops rising/falling and “turns around”. For this reason, extrema are also called stationary points or turning points. Thus, the first derivative of a function is equal to 0 at extrema. But the converse does not hold true: the first derivative of a function is not always an extrema when it equals 0. This is because a slope of 0 may also be found at a point of inflection: To discern extrema from inflection points, you can use the extremum test, aka the second derivative test. If the second derivative at the stationary point is positive (increasing) or negative (decreasing), then we know we have a minimum or a maximum, respectively. The intuition here is that the rate of change is also changing at extrema (e.g. it is going from a positive slope to a negative slope, which indicates a maximum, or the reverse, which indicates a minimum). However, if the second derivative is also 0, then we still have not distinguished the point. It may be a saddle point or on a flat region. What you can do is continue differentiating until you get a non-zero result. If we take n to be the order of the derivative yielding the non-zero result, then if n − 1 is odd, we have a true extremum. Again, if it the non-zero result is positive, then it is a minimum, if it is negative, it is maximum. CHAPTER 4. CALCULUS 91 4.1. DIFFERENTIATION 92 An example of inflection. A maxima, where the second derivative is negative 92 CHAPTER 4. CALCULUS 93 4.1. DIFFERENTIATION However, if n − 1 is even, then we have a point of inflection. Critical points A critical point are points where the function’s derivative are 0 or not defined. So stationary points are critical points. 4.1.10 Rolle’s Theorem If a function f (x) is continuous on the closed interval [a, b], is differentiable on the open interval (a, b), and f (a) = f (b), then there exists at least one number c in the interval (a, b) such that f ′ (c) = 0. This is basically saying that if you have an interval which ends with the same value it starts with, at some point in that curve the slope will be 0: 4.1.11 Mean Value Theorem If f (x) is continuous on the closed interval [a, b] and differentiable on the open interval (a, b), there exists a number c in the open interval (a, b) such that f ′ (c) = f (b) − f (a) b−a This is basically saying that there is some point on the interval where its instantaneous slope is equal to the average slope of the interval. Rolle’s Theorem is a special case of the Mean Value Theorem where f (a) = f (b). CHAPTER 4. CALCULUS 93 4.2. INTEGRATION 4.1.12 94 L’Hopital’s Rule An indeterminate limit is one which results in If limx→c f (x) g(x) is indeterminate of type 0 0 or ±∞ ±∞ , 0 0 or ±∞ ±∞ . then limx→c f (x) g(x) = limx→c f ′ (x) g ′ (x) . If the resulting limit here is also indeterminate, you can re-apply L’Hopital’s rule until it is not. Note that c can be a finite value, ∞, or −∞. 4.1.13 Taylor Series Certain functions can be expressed as an expansion of itself around a point a. This expansion is known as a Taylor series and is an infinite sum of that function and its derivatives around a: f (x) = ∞ ∑ f (n) (a) n=0 (x − a)n n! When a = 0, the series is known as a Maclaurin series. 4.2 Integration 4.2.1 Definite integral How can we find the area under a graph? We can try to approximate the area using a finite number (n) of rectangles. The area of rectangles are easy to calculate so we can just add up their area. Integration The more rectangles (i.e. increasing n) we fit, the better the approximation. So we can have n → ∞ to get the best approximation of the area under the curve. Say we have a function f (x) that is positive over some interval [a, b]. The width of each rectangle over that interval, divided into n rectangles (subintervals), is ∆x = b−a n . The endpoint of an subinterval th can be denoted xi , for i = 0, 1, . . . , n. For each i subinterval, we pick some sample point xi∗ in the interval [xi−1 , xi ]. This sample point is the height of the i th rectangle. Thus, for the i th rectangle, we have as its area: 94 CHAPTER 4. CALCULUS 95 4.2. INTEGRATION ai = xi∗ b−a n or ai = xi∗ ∆x So the total area for the interval is: n ∑ An = f (xi∗ )∆x i=1 This kind of area approximation is called a Riemann sum. The best approximation then is: lim n→∞ n ∑ f (xi∗ )∆x i=1 So we define the definite integral: Suppose f is a continuous function on [a, b] and ∆x = a and b is: ∫ b−a n . b a f (x)dx = lim An = lim n→∞ n→∞ Then the definite integral of f between n ∑ f (xi∗ )∆x i=1 where xi∗ are any sample points in the interval [xi−1 , xi ] and xk = a + k · ∆x for k = 0, . . . , n. In the expression integration. ∫b a f (x)dx, f is the integrand, a is the lower limit and b is the upper limit of Left-handed vs right-handed Riemann sums A right-handed Riemann sum is just one where xi∗ = xi , and a left-handed Riemann sum is just one where xi∗ = xi−1 . 4.2.2 Basic properties of the integral • The constant rule: ∫b a cf (x)dx = c ∫b a f (x)dx – A special case rule for integrating constants is: • Addition and subtraction rule: CHAPTER 4. CALCULUS ∫b a ∫b a cdx = c(b − a) (f (x) ± g(x))dx = ∫b a f (x)dx ± ∫b a g(x)dx 95 4.2. INTEGRATION 96 • The comparison rule ∫ – Suppose f (x) ≥ 0 for all x in [a, b]. Then ab f (x)dx ≥ 0. ∫ ∫ – Suppose f (x) ≥ g(x) for all x in [a, b]. Then ab f (x)dx ≥ ab g(x)dx. ∫ – Suppose M ≥ f (x) ≥ m for all x in [a, b]. Then M(b − a) ≥ ab f (x)dx ≥ m(b − a). • Additivity with respect to endpoints: Suppose a < c < b. Then ∫b c f (x)dx. ∫b a f (x)dx = ∫c a f (x)dx + – This is basically saying the area under the graph from a to b is equal to the area under the graph from a to c plus the area under the graph from c to b, so long as c is some point between a and b. • Power rule of integration: As long as n ̸= 1 and 0 ∈ / [a, b], or n > 0, bn+1 −an+1 n+1 4.2.3 a x n dx = x n+1 b n+1 |a = Mean Value Theorem for Integration ∫b Suppose f (x) is continuous on [a, b]. Then 4.2.4 ∫b a f (x)dx b−a = f (c) for some c in [a, b]. Antiderivatives If we have a function f which is the derivative of another function F , i.e. f = F ′ , then F is an antiderivative of f . Generally, a function f has many antiderivatives because of how constants work in derivatives. So we usually include a +C term, i.e. F (x) + C, to indicate that any constant can be added and still derive to f . Thus F often refers to a set of functions rather than a unique function. We say that the integral of f is equal to this set of functions: ∫ f (x)dx = F (x) + C This is the indefinite integral since we are not specifying a range the integral is computed over. Thus, we are not given an explicit value but rather the function(s) that results (this typically includes the ambiguous C term). Here f is known as the integrand. In a definite integral, we specify the upper and lower limits: ∫ b f (x)dx = F (x) + C a 96 CHAPTER 4. CALCULUS 97 4.2.5 4.2. INTEGRATION The fundamental theorem of calculus The fundamental theorem of calculus connects the concept of a derivative to that of an integral. Suppose that f is continuous on [a, b]. We can define a function F like so: ∫ x F (x) = a f (t)dt for x ∈ [a, b] Suppose f is continuous on [a, b] and F is defined by F (x) = ∫x a f (t)dt. Then F is differentiable on (a, b) for all x ∈ (a, b), i.e.: F ′ (x) = f (x) Thus F is the set of antiderivatives for f . Suppose f is continuous on [a, b] and F is any antiderivative of f . Then: ∫ b a f (x)dx = F (b) − F (a) Note that F (b) − F (a) may be notated as F (x)|ba . To understand why this is so, consider: F and f Say that F (X) gives the area under f (x) from 0 to x. If we want to compute the area under f (x) between x and x + h, we can do so like: CHAPTER 4. CALCULUS 97 4.2. INTEGRATION 98 F (x + h) − F (x) h Note that this is also how the derivative is calculated. So as we take the limit of h → 0: F (x + h) − F (x) = f (x) h→0 h lim Thus we have shown that F ′ (x) = f (x), that is, that F is the antiderivative of f . Given some arbitrary: F (b) − F (a) We know that this is also equal to: ∫ b f (x)dx a Therefore: ∫ b a f (x)dx = F (b) − F (a) Basic properties of indefinite integrals The integral rules defined above still apply. 98 CHAPTER 4. CALCULUS 99 4.2. INTEGRATION ∫ 1 • Power rule for indefinite integrals: For all n ̸= −1, x n dx = n+1 x n+1 + C ∫ d • Integral of the inverse function: For f (x) = x1 , remember that dx ln x = x1 , so dx x = ln |x| + C ∫ d x • Integral of the exponential function: Because dx e = e x , e x dx = e x + C • The substitution rule for indefinite integrals: Assume u is differentiable with a continuous ∫ ∫ derivative and that f is continuous on the range of u. Then f (u(x)) du dx dx = f (u)du. – Remember that du dx is not a fraction, so you’re not just “canceling” things out here. Integration by parts Suppose f and g are differentiable and their derivatives are continuous. Then: ( ∫ ) ∫ g(x)dx − f (x)g(x)dx = f (x) ∫ ( ′ f (x) ∫ ) g(x)dx dx You set f (x) in the following order, called ILATE: • • • • • I for inverse trigonometric functions L for log functions A for algebraic functions T for trigonometric functions E for expontential functions 4.2.6 Improper integrals There are two types of improper integrals: 1. Those on an unbounded function, e.g.: ∫ b f (x)dx a 2. Those on an unbounded interval, e.g.: ∫ +∞ f (x)dx a The integral on an unbounded function depicted above is known as an “improper integral with infinite integrand at b”. To compute such an integral, we just consider a point infinitesimally previous to b: CHAPTER 4. CALCULUS 99 4.2. INTEGRATION 100 An unbounded function ∫ lim+ ϵ→0 b−ϵ f (x)dx a The integral on an unbounded interval is known as an “improper integral on an infinite interval”. Here we just consider the limit: ∫ N lim N→+∞ a f (x)dx If the interval is unbounded in both directions, we consider instead two separate intervals: ∫ ∫ +∞ −∞ = ∫ 0 −∞ +∞ f (x)dx + f (x)dx 0 Say we have the integral ∫ ∞ 1 dx x2 If we set the upper bound to be a finite value b and have it approach infinity, we get: ∫ lim b b→∞ 1 100 ( ) dx 1 1 = lim − 2 b→∞ 1 x b ( ) 1 = lim 1 − b→∞ b =1 CHAPTER 4. CALCULUS 101 4.2. INTEGRATION The formal definition: 1. Suppose ∫b a f (x)dx exists for all b ≥ a. Then we define ∫ ∞ ∫ b f (x)dx = lim b→∞ a a f (x)dx as long as this limit exists and is finite. If it does exist we say the integral is convergent and otherwise we say it is divergent. 2. Similarly if ∫b a f (x)dx exists for all a ≤ b we define ∫ ∫ b −∞ f (x)dx = lim a→−∞ a 3. Finally, suppose c is a fixed real number and that Then we define ∫ ∞ −∞ ∫ f (x)dx = ∫c −∞ b f (x)dx f (x)dx and ∫ c −∞ f (x)dx + ∞ ∫∞ c f (x)dx are both convergent. f (x)dx c . Improper integrals with a finite number of discontinuities Suppose f is continuous on [a, b] except at points c1 < c2 < · · · < cn in [a, b]. We define ∫ ∫ b f (x)dx = a ∫ c1 c2 f (x)dx + a c1 f (x)dx + · · · + ∫ b f (x)dx cn as long as each integral on the right converges. Improper integral with one discontinuity As a simpler example, say we have an improper integral with a single discontinuity. If f is continuous on the interval [a, b) and is discontinuous at b, we define ∫ ∫ b a f (x)dx = lim− c→b c f (x)dx a If this limit exists, the integral we say it converges and otherwise we say it diverges. Similarly, if f is continuous on the interval (a, b] and is discontinuous at a, we define CHAPTER 4. CALCULUS 101 4.3. MULTIVARIABLE CALCULUS 102 ∫ ∫ b a f (x)dx = lim+ c→a c f (x)dx a Finally, if f has a discontnuity at a point c in (a, b) and is continuous at all other points in [a, b], if ∫ ∫ both ac f (x)dx and cb f (x)dx converge, we define ∫ ∫ b f (x)dx = b f (x)dx + a 4.3 ∫ c a f (x)dx c Multivariable Calculus We are frequently dealing with data in many dimensions, so we must expand the previous concepts of derivatives and integrals to higher-dimensional spaces. 4.3.1 Integration Double integrals A definite integral for y = f (x) is the area under the curve of f (x), which is the sum of the areas of infinitely small rectangles assembled in the shape of the curve. But say we are working with three dimensions, i.e. we have z = f (x, y ). Then the volume under the surface of f (x, y ) is the sum of the volumes of infinitely small chunks in the shape of the surface. The area of one face of that chunk is the area under the curve, with respect to x, from x = 0 to x = b (in the illustration below), i.e. the integral: ∫ b f (x, y )dx 0 Because this is with respect to x, this integral will be some function of y , e.g. g(y ). To get the volume of this chunk, we multiply that area by some depth dy , so the volume of a chunk is: (∫ ) b f (x, y )dx dy 0 So if we want to get the volume in the bounds of y = 0, y = a, then we integrate again: ∫ a (∫ b ) f (x, y )dx dy 0 102 0 CHAPTER 4. CALCULUS 103 4.3. MULTIVARIABLE CALCULUS A double integral! It is also written without the parentheses: ∫ a ∫ b f (x, y )dxdy 0 0 Illustration of a double integral. Note that here we first integrated wrt to x and then y , but you can do it the other way around as well (integrate wrt y first, then x). Note: the lower bounds here were 0 but that’s just an example. Another way of conceptualizing double integrals You could instead conceptualize the double integral as the sum of the volumes of infinitely small columns: The area of each column’s base, dx · dy , is sometimes notated as dA. Variable boundaries In the previous example we had fixed boundaries (see accompanying illustration, on the left). What if instead we have a variable boundary (see accompanying illustration, on the right. The lower x boundary varies now). Well you express variable boundaries as a functions. As is the case in the example above, the lower x boundary is some function of y , g(y ). So the volume would be: ∫ a ∫ b f (x, y )dxdy 0 CHAPTER 4. CALCULUS g(y ) 103 4.3. MULTIVARIABLE CALCULUS 104 Another illustration of a double integral. Illustration of variable boundaries. 104 CHAPTER 4. CALCULUS 105 4.3. MULTIVARIABLE CALCULUS That’s if you first integrate wrt to x. If you first integrate wrt to y , instead the upper y boundary is varying and that would be some function of x, h(x), i.e.: ∫ b ∫ h(x) f (x, y )dy dx 0 0 Triple integrals Triple integrals also involve infinitely small volumes and in many cases are no different than double integrals. Illustration of a triple integral. So why use triple integrals? Well they are good for calculating the mass of something - if the density under the surface is not uniform. The density at a given point is expressed as f (x, y , z), so the mass of a variably dense volume can be expressed as: ∫ xf inal ∫ yf inal ∫ zf inal f (x, y , z )dzdy dx x0 4.3.2 y0 z0 Partial derivatives Say you have a function z = f (x, y ). With two variables, we are now working in three dimensions. How does differential calculus work in 3 (or more) dimensions? In three dimensions, what is the slope at a given point? Any given point has an infinite number of tangent lines (only one tangent plane though). So when you take a derivative in three dimensions, you have to specify what direction that derivative is in. CHAPTER 4. CALCULUS 105 4.3. MULTIVARIABLE CALCULUS 106 Say we have z = x 2 +xy +y 2 . If we want to take a derivative of this function, we have to hold one variable constant and derive with respect to the other variable. This derivative is called a partial derivative. If we were doing it wrt to x, then it would be notated as: ∂z ∂x or fx (x, y ) So we could work this out as: y =C z = x 2 + xC + C 2 ∂z = 2x + C ∂x ∂z = 2x + y ∂x A 3d function (the sphere function, z = x 2 + y 2 ) Then you could get the partial derivative wrt to y , i.e.: x =C z = C 2 + Cy + y 2 ∂z = C + 2y ∂y ∂z = x + 2y ∂y The plane that these two functions define together for a given point (x, y ) is the tangent plane at that point. More generally, for a function f (x, y ), the partial derivatives would be: ∂ f (x + ∆, y ) − f (x, y ) f (x, y ) = lim ∆→0 ∂x ∆ ∂ f (x, y + ∆) − f (x, y ) f (x, y ) = lim ∆→0 ∂y ∆ The partial derivative tells us how much the output of a function f changes with the given variable. As alluded to earlier, this is important for machine learning because it tells us how a change in each weight in a multidimensional problem will affect f . 4.3.3 Directional derivatives Partial derivatives can be generalized into directional derivatives, which are derivatives with respect to to any arbitrary line (it does not have to be, for example, with respect to the x or y axis). That 106 CHAPTER 4. CALCULUS 107 4.3. MULTIVARIABLE CALCULUS is, with respect to any arbitrary direction. We represent a direction as a unit vector. 4.3.4 Gradients A gradient is a vector of all the partial derivatives at a given point, which is to say it is a generalization of the derivative from two-dimensions to higher dimensions. The gradient of a function f , with a vector input w = [w1 , . . . , wd ] is notated: ∂f ∂w1 . ∇w f = .. ∂f ∂wd Sometimes it is just notated as ∇. The gradient of some function f (x, y ), i.e. z = f (x, y ) is: ∇f = fx î + fy ĵ That is, the partial derivative of f wrt to x times the unit vector in the x direction, î , plus the partial of f wrt to y times the unit vector in the y direction, ĵ. It can also be written (this is just different notation): ∇f = ∂ ∂ f (x, y )î + f (x, y )ĵ ∂x ∂y It’s worth noting that this can be thought of in terms of matrices, i.e. given some function f : Rm×n → R (that is, it takes a matrix A ∈ Rm×n and returns a real value), then the gradient of f , with respect to the matrix A, is the matrix of partial derivatives: ∂f (A) ∂A11 ∂f (A) ∂A21 ∂f (A) ∂A12 ∂f (A) ∂A22 .. . ··· ··· .. . ∂f (A) ∂Am1 ∂f (A) ∂Am2 ··· ∇A f (A) ∈ Rm×n = ... Which is to say that (∇A f (A))ij = ∂f (A) ∂A1n ∂f (A) ∂A2n .. . ∂f (A) ∂Amn ∂f (A) ∂Aij . Properties Some properties, taken from equivalent properties of partial derivatives, are: • ∇x (f (x) + g(x)) = ∇x f (x) + ∇x g(x) • For t ∈ R, ∇x (tf (x)) = t∇x f (x) CHAPTER 4. CALCULUS 107 4.3. MULTIVARIABLE CALCULUS 108 Example Say we have the function f (x, y ) = x 2 + xy + y 2 . Using the partials we calculated previously, the gradient is: ∇f = (2x + y )î + (2y + x)ĵ So what we’re really calculating here is a vector field, which gives an x and a y vector, with the magnitude of the partial derivative of f wrt to x and the partial derivative of f wrt to y , respectively, then getting the vector which is the sum of those two vectors. What the gradient tells us is, for a given point, what direction to travel to get the maximum slope for z. 4.3.5 The Jacobian For the vector F (x) = [f (x)1 , . . . , f (x)k ]T , the Jacobian, notated ∇x F (X) or as just J, is: (x)1 · · · . .. .. ∇x F (x) = . ∂ ∂x1 f (x)k · · · ∂ f ∂x1 ∂ ∂xd f (x)1 .. . ∂ ∂xd f (x)k That is, it is an m × n matrix of the first-order partial derivatives for a function f : Rn → Rm , (i.e. for a function that defines a vector field). To clarify, the difference between the gradient and the Jacobian is that the gradient is for a single function, thus yielding a vector, whereas the Jacobian is for multiple functions, thus yielding a matrix. 4.3.6 The Hessian Say we have a function f : Rn → R, which takes as input a some vector x ∈ Rn and returns a real number (that is, it defines a scalar field). The Hessian matrix with respect to x, written ∇2x f (x) or as just H, is the n×n matrix of second-order partial derivatives: ∂ 2 f (x) ∂x12 ∂ 2 f (x) ∇2x f (x) ∈ Rn×n ∂x2 ∂x1 = . .. ∂ 2 f (x) ∂xn ∂x1 Which is to say (∇2x f (x))ij = 108 ∂ 2 f (x) ∂x1 ∂x2 ∂ 2 f (x) ∂x22 ··· .. . ··· .. . ∂ 2 f (x) ∂xn ∂x2 ··· ∂ 2 f (x) ∂x1 ∂xn ∂ 2 f (x) ∂x2 ∂xn .. . ∂ 2 f (x) ∂xn2 ∂ 2 f (x) ∂xi ∂xj . CHAPTER 4. CALCULUS 109 4.3. MULTIVARIABLE CALCULUS Wherever the second partial derivatives are continuous, the Hessian is symmetric, i.e. In machine learning, the Hessian is typically completely symmetric. ∂ 2 f (x) ∂xi ∂xj = ∂ 2 f (x) ∂xj ∂xi . Just as the second derivative test is used to check if a critical point is a maximum, a minimum, or still ambiguous, as the Hessian is composed of second-order partial derivatives, it does the same for multiple dimensions. This is accomplished as follows. If the Hessian matrix is real and symmetric, it can be decomposed into a set of real eigenvalues and an orthogonal basis of eigenvectors. At critical points of the function we can look at the Hessian’s eigenvalues: • If the Hessian is positive definite, we have a local minimum (because movement in any direction is positive) • If the Hessian is negative definite, we have a local maximum (because movement in any direction is negative) • When at least one eigenvalue is positive and at least one is negative, we have a saddle point • When all non-zero eigenvalues are of the same sign, but at least one is zero, we still have an ambiguous critical point The Jacobian and the Hessian are related by: H(f )(x) = J(∇f )(x) Intuitively, the i , jth element of the Hessian tells how the i, jth dimension accelerate together. For example, if the element is negative, then as one dimension accelerates, the other decelerates. 4.3.7 Scalar and vector fields A scalar field just means a space where, for any point, you can get a scalar value. For example, with f (x, y ) = x 2 + xy + y 2 , for any (x, y ) you get a scalar value. A vector field is similar but instead of just a scalar value, you get a value and a direction. For example, V⃗ = 2x î + 5y ĵ or V⃗ = x 2 y î + y ĵ. 4.3.8 Divergence Say we have a vector field V⃗ = x 2 y î + 3y ĵ. The divergence of that vector field is: di v (V⃗ ) = ∇ · V⃗ CHAPTER 4. CALCULUS 109 4.3. MULTIVARIABLE CALCULUS 110 An example of a 2D vector field That is, it is the dot product (which tells us how much two vectors move together) of the gradient and the vector field. So for our example: ∂ ∂ î + ĵ ∂x ∂y ∂ 2 ∂ ∇·⃗ v= (x y ) + (3y ) ∂x ∂y = 2xy + 3 ∇= The divergence, which is scalar number for any point in a vector field, represents the change in volume density from an infinitesimal volume around a given point in that field. A positive divergence means the volume density is decreasing (more going out than coming in); a negative divergence means the volume density is increasing (more is coming in than going on, this is also called convergence). A divergence of 0 means the volume density is not changing. Using our previously calculated divergence, say we want to look at the point (4, 3). We get the divergence 2 · 4 · 3 + 3 = 27 This means that, in an infinitesimal volume around the point (4, 3), the volume is decreasing. 110 CHAPTER 4. CALCULUS 111 4.3.9 4.4. DIFFERENTIAL EQUATIONS Curl The curl measures the rotational effect of a vector field at a given point. Unlike divergence, where we are seeing how much the gradient and the vector field move together, we are interested in seeing how they move against each other. So we use their cross product: curl(V⃗ ) = ∇ × V⃗ 4.3.10 Optimization with eigenvalues Consider, for a symmetric matrix A ∈ Sn the following equality-constrained optimization problem: maxn x T Ax, subject to ||x||22 = 1 x∈R Optimization problems with equality constraints are typically solved by forming the Lagrangian, an objective function which includes the equality constraints. For this particular problem (i.e. with the quadratic form), the Lagrangian is: L(x, λ) = X T Ax − λx T x Where λ is the Lagrange multiplier associated with the equality constraint. For x ∗ to be the optimal point to the problem, the gradient of the Lagrangian has to be zero at x ∗ (among other conditions), i.e.: ∇x L(x, λ) = ∇x (x T Ax − λx T x) = 2AT x − 2λx = 0 This is just the linear equation Ax = λx, so the only points which can maximize (or minimize) x T Ax, assuming x T x = 1, are the eigenvectors of A. 4.4 Differential Equations Differential equations are simply just equations that contain derivatives. Ordinary differential equations (ODEs) involve equations containing: • variables • functions • their derivatives CHAPTER 4. CALCULUS 111 4.4. DIFFERENTIAL EQUATIONS 112 • their solutions This is contrasted to partial differential equations (PDEs), which contain partial derivatives instead of ordinary derivatives. 4.4.1 Solving simple differential equations Say we have: f ′′ (x) = 2 First we can integrate both sides: ∫ ′′ ∫ f (x)dx = 2dx f ′ (x) = 2x + C1 Then we can integrate once more: ∫ ′ f (x)dx = ∫ 2x + C1 dxf (x) = x 2 + C1 x + C2 So our solution is f (x) = x 2 + C1 x + C2 . For all values of C1 and C2 , we will get f ′′ = 2. The values C1 and C2 represent initial conditions, e.g. the starting conditions of a model. 4.4.2 Basic first order differential equations There are four main types (though there are many others) of differential equations: • • • • separable homogenous linear exact Separable differential equations A separable equation is in the form: dy f (x) = dx g(x) You can group the terms together like so: 112 CHAPTER 4. CALCULUS 113 4.4. DIFFERENTIAL EQUATIONS g(y )dy = f (x)dx And then integrate both sides to obtain the solution: ∫ ∫ g(y )dy = f (x)dx + C Example Say we want to solve dy = 3x 2 y dx Separate the terms: dy = (3x 2 )dx y Then integrate: ∫ ∫ dy = 3x 2 dx y ln y = x 3 + C y = ex 3 +C If we let k = e C so k is a constant, we can write the solution as: y = ke x 3 Homogenous differential equations A homogenous equation is in the form: dy = f (y /x) dx To make things easier, we can use the substitution v= CHAPTER 4. CALCULUS y x 113 4.4. DIFFERENTIAL EQUATIONS 114 so dy = f (v ) dx Then we can set y = xv and use the product rule, so that we get: dy dx dv v +x dx dv x dx dv dx =v +x dv dx = f (v ) = f (v ) − v = f (v ) − v x so now the equation is in separable form and be solved as a separable equation. Linear differential equations A linear first order differential equation is a differential equation in the form: dy + f (x)y = g(x) dx To solve, you multiply both sides by I = e ∫ f (x)dx and integrate. I is known as the integrating factor. Example y ′ − 2xy = x So in this case, f (x) = −2x, and g(x) = x, so the equation could be written: y ′ + f (x)y = g(x) So, we calculate the integrating factor: ∫ I=e −2xdx = e −x 2 and multiply both sides by I, i.e.: 114 CHAPTER 4. CALCULUS 115 4.5. REFERENCES e −x (y ′ − 2xy ) = xe −x 2 (e −x · y ′ ) − 2xe −x y = xe −x 2 2 2 ∫ ((e −x 2 ′ 2 · y ) − 2xe −x 2 ∫ y )dx = xe −x dx 2 and work out the integration. Exact differential equations An exact equation is in the form: f (x, y ) + g(x, y ) such that df dx = dy =0 dx dg dx . There exists some function h(x, y ) where dh = f (x, y ) dx dh = g(x, y ) dy so long as f , g, 4.5 • • • • • df dy and dg dx are continuous on a connected region. References Calculus. Revised 14 October 2013. Wikibooks. Multivariable Calculus. Khan Academy. Linear Algebra Review and Reference. Zico Kolter. October 16, 2007. Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. Math for Machine Learning. Hal Daumé III. August 28, 2009. CHAPTER 4. CALCULUS 115 4.5. REFERENCES 116 116 CHAPTER 4. CALCULUS 117 5 Probability Probability theory is the study of uncertainty. 5.1 Probability space We typically talk about the probability of an event. The probability space defines the possible outcomes for the event, and is defined by the triple (Ω, F, P ), where • Ω is the space of possible outcomes, i.e. the outcome space (sometimes called the sample space). • F ⊆ 2Ω , where 2Ω is the power set of Ω (i.e. the set of all subsets of Ω, including the empty set ∅ and Ω itself, the latter of which is called the trivial event), is the space of measurable events or the event space. • P is the probability measure, i.e. the probability distribution, that maps an event E ∈ F to a real value between 0 and 1 (that is, P is a function that outputs a probability for the input event). For example, we have a six-sided dice, so the space of possible outcomes Ω = {1, 2, 3, 4, 5, 6}. We are interested in whether or not the dice roll is odd or even, so the event space is F = {∅, {1, 3, 5}, {2, 4, 6}, Ω}. The outcome space Ω may be finite, in which the event space F is typically taken to be 2Ω , or it may be infinite, in which the probability measure P must satisfy the following axioms: • non-negativity: for all α ∈ F, P (α) ≥ 0. • trivial event: P (Ω) = 1. • additivity: For all α, β ∈ F and α ∩ β = ∅, P (α ∪ β) = P (α) + P (β). Other axioms include: CHAPTER 5. PROBABILITY 117 5.2. RANDOM VARIABLES 118 • 0 ≤ P (a) ≤ 1 • P (True) = 1 and P (False) = 0 We refer to an event whose outcome is unknown as a trial, or an experiment, or an observation. An event is a trial which has resolved (we know the outcome), and we say “the event has occurred” or that the trial has “satisfied the event”. The compliment of an event is everything in the outcome space that is not the event, and may be notated in a few ways: ¬E, E C , Ē. If two events cannot occur together, they are mutually exclusive. More concisely (from Probability, Paradox, and the Reasonable Person Principle): • Experiment: An occurrence with an uncertain outcome that we can observe. – For example, rolling a die. • Outcome: The result of an experiment; one particular state of the world. Synonym for “case.” – For example: 6. • Sample Space: The set of all possible outcomes for the experiment. (For now, assume each outcome is equally likely.) – For example, {1, 2, 3, 4, 5, 6}. • Event: A subset of possible outcomes that together have some property we are interested in. – For example, the event “even die roll” is the set of outcomes {2, 4, 6}. • Probability: The number of possible outcomes in the event divided by the number in the sample space. – For example, the probability of an even outcome from a six-sided die is |{2, 4, 6}| / |{1, 2, 3, 4, 5, 6}| = 3/6 = 1/2. 5.2 Random Variables A random variable (sometimes called a stochastic variable) is a function which maps outcomes to real values (that is, they are technically not variables but rather functions), dependent on some other probabilistic factor. Random variables represent uncertain events we are interested in with a numerical value. Random variables are typically denoted by a capital letter, e.g. X. Values they may take are typically represented with a lowercase letter, e.g. x. When we use P (X) we are referring to the distribution of the random variable X, which describes the probabilities that X takes on each of its possible values. Contrast this to P (x) which describes the probability of some arbitrary single value x; this is shorthand for describing the probability that some random variable (e.g. X) takes on that value x, which is more explicitly denoted P (X = x) or PX (x). This represents some real value. 118 CHAPTER 5. PROBABILITY 119 5.3. JOINT AND DISJOINT PROBABILITIES For example, we may be flipping a coin and have a random variable X which takes on the value 0 if the flip results in heads and 1 if it results in tails. If it’s a fair coin, then we could say that P (X = heads) = P (X = tails) = 0.5. Random variables may be: • Discrete: the variable can only have specific values, e.g. on a 5 star rating system, the random variable could only be one of the values [0, 1, 2, 3, 4, 5]. Another way of describing this is that the space of the variable’s possible values (i.e. the outcome space) is countable and finite. For discrete random variables which are not numeric, e.g. gender (male, female, etc), we use an indicator function I to map non-numeric values to numbers, e.g. male = 0, female = 1, . . . ; we call variables from such functions indicator variables. • Continuous: the variable can have arbitrarily exact values, e.g. time, speed, distance. That is, the outcome space is infinite. • Mixed: these variables assign probabilities to both discrete and continuous random variables. 5.3 Joint and disjoint probabilities • Joint probability: P (a ∩ b) = P (a ∧ b) = P (a, b) = P (a)P (b|a), the probability of both a and b occurring. • Disjoint probability: P (a ∪ b) = P (a ∨ b) = P (a) + P (b) − P (a, b), the probability of a or b occurring. Probabilities can be visualized as a Venn diagram: Probability The overlap is where both a and b occur (P (a, b)). The previous axiom, describing P (a ∪ b), adds the spaces where a and b each occur, but this counts their overlap twice, so we subtract it once. 5.4 Conditional Probability The conditional probability is the probability of A given B, notated P (A|B). CHAPTER 5. PROBABILITY 119 5.5. INDEPENDENCE 120 Formally, this is: P (A|B) = P (A ∩ B) P (B) Where P (B) > 0. For example, say you have two die. Say A = {snake eyes} and B = {double}. 1 P (A) = 36 since out of the 36 possible dice pairings, only one is snake eyes. P (B) = the 36 possible dice pairings are doubles. 1 6 since 6 of Now what is P (A|B)? Well, intuitively, if B has happened, we have reduced our possible event space to just the 6 doubles, one of which is snake eyes, so P (A|B) = 16 . Intuitively, this can be thought of as the probability of a given a universe where b occurs. The overlapping region is P (a|b) So we ignore the part of the world in which b does not occur. This can be re-written as: P (a, b) = P (a|b)P (b) 5.5 Independence Events X and Y are independent if: P (X ∩ Y ) = P (X)P (Y ) That is, their outcomes are unrelated. Another way of saying this is that: 120 CHAPTER 5. PROBABILITY 121 5.6. THE CHAIN RULE OF PROBABILITY P (X|Y ) = P (X) Knowing something about Y tells us nothing more about X. The independence of X and Y can also be notated as X ⊥ Y . From this we can infer that: P (X, Y ) = P (X)P (Y ) ∩ More generally we can say that events A1 , . . . , An are mutually independent if P ( i∈S Ai ) = ∏ i∈S P (Ai ) for any S ⊂ {1, . . . , n}. That is, the joint probability of any subset of these events is just equal to the product of their individual probabilities. Mutual independence implies pairwise independence, but note that the converse is not true (that is, pairwise independence does not imply mutual independence). 5.5.1 Conditional Independence Conditional independence is defined as: P (X|Y, Z) = P (X|Z) If X is independent of Y conditioned on Z. Which is to say X is independent of Y if Z is true or known. From this we can infer that: P (X, Y |Z) = P (X|Z)P (Y |Z) Note that mutual independence does not imply conditional independence. Similarly, we can say that events A1 , . . . , An are conditionally independent given C if P ( ∏ i∈S P (Ai |C) for any S ⊂ {1, . . . , n}. 5.6 ∩ i∈S Ai |C) = The Chain Rule of Probability Say we have the joint probability P (a, b, c). How do we turn this into conditional probabilities? We can set y = b, c (that is, the intersection of b and c), then we have P (a, b, c) = P (a, y ), and we can just apply the previous equation: CHAPTER 5. PROBABILITY 121 5.7. COMBINATIONS AND PERMUTATIONS 122 P (a, y ) = P (a|y )P (y ) P (a, b, c) = P (a|b, c)P (b, c) And we can again apply the previous equation to P (b, c), which gets us: P (a, b, c) = P (a|b, c)P (b|c)P (c) This is generalized as the chain rule of probability: P (x1 , . . . , xn ) = 1 ∏ P (xi |xi−1 , . . . , x1 ) i=n 5.7 Combinations and Permutations 5.7.1 Permutations With permutations, order matters. For instance, AB and BA are different permutations (though they are the same combination, see below). Permutations are notated: x Py = Pyx = P (x, y ) Where: • x = total number of “items” • y = “spots” or “spaces” or “positions” available for the items. A permutation can be expanded like so: n Pk = n, n − 1, n − 2, . . . , n − (k − 1) And generalized to the following formula: n Pk = n! (n − k)! For example, consider 7 P3 . 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1, and we only have 3 positions, i.e. 7-6-5, so we divide by 4! to get our final answer. 122 CHAPTER 5. PROBABILITY 123 5.7. COMBINATIONS AND PERMUTATIONS 5.7.2 Combinations With combinations, the order doesn’t matter. The notation is basically the same as permutation, except with a C instead of a P .: n Ck = n Pk k! or, expanded: ( ) n! n = n Ck = k!(n − k)! k The n and k pairing together is known as the binomial coefficient, and read as “n choose k”. 5.7.3 Combinations, permutations, and probability Example Say you have a a coin. What is P ( 83 H)? That is, what is the probability of flipping exactly 3 heads? ( ) P ( 38 H) is equal to the combination of of 3 head flips out of the total 8 flips. 8 3 . That is, we are basically trying to find all the combinations 8 C3 = 56 So there are 56 possible outcomes that result in exactly 3 heads. Because a coin has two possible outcomes, and we’re flipping 8 times, we know there are 28 total possible outcomes. So to figure out the probability, we can just take the ratio of these outcomes. 7 56 3 P ( H) = 8 = 8 2 32 Example Given outcome P (A) = 0.8 and P (B) = 0.2, what is P ( 35 A)? That is, the possibility of exactly 3 out of 5 trials being A. CHAPTER 5. PROBABILITY 123 5.8. PROBABILITY DISTRIBUTIONS 124 Basically, like before, we’re looking for possible combinations of 35 A, that is 5 choose 3, i.e. 5 C3 = 10. So we know there are 10 possible outcomes resulting in 35 A. But what is the probability of a single combination that results in 35 A? We were given the probabilities so it’s just multiplying: P (A)P (A)P (A)P (B)P (B) = 0.83 × 0.22 So then we just multiply the number of these combinations, 10, by this resulting probability to get the final answer. 5.8 Probability Distributions For some random variable X, there is a probability distribution function P (X) (usually just called the probability distribution); the particular kind depends on what kind of random variable X is (i.e. discrete or continuous). A probability distribution describes the probability of a random variable taking on its possible values. If the random variable X is distributed according to, say, a Poisson distribution, we say that “X is Poisson-distributed”. Distributions themselves are described by parameters - variables which determine the specifics of a distribution. Different kinds of distributions are described by different sets of parameters. For instance, the particular shape of a normal distribution is determined by µ (the mean) and σ (the standard deviation); we say the normal distribution is parameterized by µ and σ. Often we are given a set of data and assume it comes from a particular kind of distribution, such as a normal distribution, but we don’t know the specific parameterization of the distribution; that is, with the normal distribution example, we don’t know what µ and σ are, so we use the data we have and try and infer these unknown parameters. 5.8.1 Probability Mass Functions (PMF) For discrete random variables, the distribution is a probability mass function. It is called a “mass function” because it divides a unit mass (the total probability) across the different values the random variable can take. In the example figure, the random variable can take on one of three discrete values, {1, 3, 7}, with the corresponding probabilities illustrated in the PMF. 5.8.2 124 Probability Density Functions (PDF) CHAPTER 5. PROBABILITY 125 5.8. PROBABILITY DISTRIBUTIONS For continuous random variables we have a probability density function. A probability density function f is a non-negative, integrable function such that: ∫ f (x)dx = 1 Val(X) Where Val(X) denotes the range of the random variable X, so this is the integral over the range of possible values for X. The total area under the curve sums to 1 (which is to say that the aggregate probability of all possible values for X sums to 1). An example PMF The probability of a random variable X distributed according to a PDF f is computed: P (a ≤ X ≤ b) = ∫ b f (x)dx a It’s worth noting that this implies that, for a continuous random variable X, the probability of taking on any given single value is zero (when dealing with continuous random variables there are infinitely precise values). Rather, we compute probabilities for a range of values of X; the probability is given by the area under the curve over this range. 5.8.3 Distribution Patterns There are a few ways we can describe distributions. • • • • Unimodal: The distribution has one main peak. Bimodal: The distribution has two (approximately) equivalent main peaks. Multimodal: The distribution has more than two (approximately) equivalent main peaks. Symmetrical: The distribution falls in equal numbers on both sides of the middle. Skewness Skewness describes distributions that have greater density on one side of the distribution. The side with less is the direction of the skew. Skewness is defined: skewness = E[( CHAPTER 5. PROBABILITY X−µ 3 ) ] σ 125 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) 126 An example PDF (the normal distribution) Where σ is just the standard deviation. The normal distribution has a skewness of 0. Kurtosis Kurtosis describes how the shape differs from a normal curve (if the tails are lighter or heavier). Kurtosis is defined: kurtosis = E[(X − µ)4 ] (E[(X − µ)2 ])2 The standard normal distribution has a kurtosis of 3, so sometimes kurtosis is standardized by subtracting 3; this standardized kurtosis measure is called the excess kurtosis. 5.9 Cumulative Distribution Functions (CDF) A cumulative distribution function CDF(x), often also denoted F (x), describes the cumulative probability up to where the random variable takes the value x, which is to say it tells you P (X ≤ x). The complimentary distribution (CCDF) of a distribution is 1 − CDF(x). 126 CHAPTER 5. PROBABILITY 127 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) Common distribution descriptors CHAPTER 5. PROBABILITY 127 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) 5.9.1 128 Discrete random variables The cumulative distribution function of a discrete random variable is just the sum of the probabilities for the values up to x. Say our discrete random variable X can take values from {x1 , . . . , xn }. We can define the CDF for X as: ∑ CDF(x) = P (X = xi ) {xi |xi ≤x} The complete discrete CDF is a step function, as you might expect because the CDF is constant between discrete values. An example discrete CDF 5.9.2 Continuous random variables The cumulative distribution function of a continuous random variable is: ∫ CDF(x) = x −∞ P (X)dx That is, it is the integral of the PDF up to the value in question. 128 CHAPTER 5. PROBABILITY 129 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) CDF Probability values for a specific range x1 → x2 can be calculated with: CDF(x2 ) − CDF(x1 ) or more simply: ∫ x2 CDF(x) = P (x)dx x1 5.9.3 Using CDFs Visually, there are a few tricks you can do with CDFs. You can estimate the median by looking at where CDF(x) = 0.5, i.e.: You can estimate the probability that your x falls between two values: You can estimate a confidence interval as well. For example, the 90% confidence interval by looking at the x values in the range which produces CDF(x) = 0.05 and CDF(x) = 0.95. CHAPTER 5. PROBABILITY 129 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) 130 An example continuous CDF (for the normal distribution) Estimating the median value with a CDF 130 CHAPTER 5. PROBABILITY 131 5.9. CUMULATIVE DISTRIBUTION FUNCTIONS (CDF) Estimating the probability of value falling in a range with a CDF Estimating the 80% confidence interval with a CDF CHAPTER 5. PROBABILITY 131 5.10. EXPECTED VALUES 5.9.4 132 Survival function The survival function of a random variable X is the complement of the CDF, that is, it is the probability that the random variable is greater than some value x, i.e. P (X > x). So the survival function is: S(x) = P (X > x) = 1 − CDF(x) 5.10 Expected Values The expected value of a random variable X is: E[X] = µ That is, it is the average (mean) value. It can be thought of as a way of “summarizing” a random variable to a single value. It can be thought of as a sample from a potentially infinite population. A sample from that population is expected to be the mean of that population. The value of that mean depends on the distribution of that population. 5.10.1 Discrete random variables For a discrete random variable X with a sample space X(Ω) (i.e. all possible values X can take) and with a PMF p, the expected value is: E[X] = ∑ xP (X = x) x∈X(Ω) The expected value exists only if this sum is well-defined, which basically means it has to aggregate in some clear way, as either a finite value or positive or negative infinity. But it can’t, for instance, contain a positive infinity and a negative infinity term simultaneously, because it’s undefined how those combine. For example, consider the infinite sum 1 − 1 + 1 − 1 + . . . . The −1 terms go to negative infinity and the +1 terms go to positive infinity - this is an undefined sum. 132 CHAPTER 5. PROBABILITY 133 5.10.2 5.10. EXPECTED VALUES Continuous random variables For a continuous random variable X and a PDF f , the expected value is: ∫ E[X] = ∞ −∞ xf (x)dx The expected value exists only when this integral is well-defined. 5.10.3 The expectation rule A function g(X) of a random variable X is itself a random variable. The expected value of that function can be expressed based on the random variable X, like so: ∑ E[g(x)] = g(x)p(x) x∈X(Ω) ∫ ∞ E[g(x)] = −∞ g(x)f (x)dx Using whichever is appropriate, depending on if X is discrete or continuous. 5.10.4 Jensen’s Inequality Jensen’s Inequality states that given a convex function g, then E[g(X)] ≥ g(E[X]). 5.10.5 Properties of expectations For random variables X and Y where E[|X|] < ∞ and E[|Y |] < ∞ (that is, E[X], E[Y ] are finite), we have the following properties: 1. E(a) = a for all a ∈ R. That is, the expected value of a constant is just the constant. This is called the normalization property. 2. E(aX) = aE(X) for all a ∈ R 3. E(X + Y ) = E(X) + E(Y ) 4. If X ≥ 0, that is, all possible values of X are greater than 0, then E[X] ≥ 0 5. If X ≤ Y , that is, each possible value of X is less than each possible value of Y , then E[X] ≤ E[Y ]. This is called the order property. 6. If X and Y are independent, then E[XY ] = E[X]E[Y ]. Note that the converse is not true; that is, if E[XY ] = E[X]E[Y ], this does not necessarily mean that X and Y are independent. 7. E[IA (X)] = P (X ∈ A), that is, the expected value of an indicator function: 1 if x ∈ A IA (x) = 0 otherwise CHAPTER 5. PROBABILITY 133 5.11. VARIANCE 134 is the probability that the random variable X is in A. Properties 2 and 3 are called linearity. To put linearity another way: Let X1 , X2 , . . . , Xn be random variables, which may be dependent or independent: E(X1 + X2 + · · · + Xn ) = E(X1 ) + E(X2 ) + · · · + E(Xn ) 5.11 Variance The variance of a distribution is the “spread” of a distribution. The variance of a random variable X tells us how spread out the data is along that variable’s axis/dimension. It can be defined in a couple ways: Var(X) = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 Variance is not a linear function of X, for instance: Var(aX + b) = a2 Var(X) If random variables X and Y are independent, then: Var(X + Y ) = Var(X) Var(Y ) 5.11.1 Covariance The covariance of two random variables is a measure of how “closely related” they are: Cov(X, Y ) = E[(X − E[X])(Y − E[Y ])] With more than two variables, a covariance matrix is used. Covariance matrices show two things: 134 CHAPTER 5. PROBABILITY 135 5.12. COMMON PROBABILITY DISTRIBUTIONS • the variance of a variable i , located at the i , i element • the covariance of variables i , j, located at the i , j and j, i elements If the covariance between two variables is negative, then we have a downward slope, if it is positive, then we have an upward slope. So the covariance matrix tells us a lot about the shape of the data. Example covariances 5.12 Common Probability Distributions Here a few distributions you are likely to encounter are described in more detail. 5.12.1 Probability mass functions Bernoulli Distribution A random variable distributed according to the Bernoulli distribution can take on two possible values 0, 1, typically described as “failure” and “success” respectively. It has one parameter p, the probability of success, which is taken to be P (X = 1). Such a random variable is sometimes called a Bernoulli random variable. The distribution is described as: P (X) = p x (1 − p)1−x CHAPTER 5. PROBABILITY 135 5.12. COMMON PROBABILITY DISTRIBUTIONS 136 And for a Bernoulli random variable X, it is notated X ∼ Ber(p)). X is 1 with probability p and X is 0 with probability 1 − p. The mean of a Bernoulli distribution is µ = p, and the standard deviation is σ = √ p(1 − p). A Bernoulli distribution describes a single trial, though often you may consider multiple trials, each with its own random variable. Geometric Distribution Say we have a set of iid Bernoulli random variables, each representing a trial. What is the probably of finding the first success at the nth trial? This can be described with a geometric distribution, which is a distribution where the probabilities decrease exponentially fast. It is formalized as: P (n) = (1 − p)n−1 p With the mean µ = 1 p and the standard deviation σ = √ 1−p p2 . Binomial Distribution Suppose you have a binomial experiment (i.e. one with two mutually exclusive outcomes, such as “success” or “failure”) of n trials (that is, n Bernoulli trials), where p is the probability of success on an individual trial, and is the same across all trials (that is, the trials are n iid Bernoulli random variables). You want to determine the probability of k successes in those n trials. Note that binomial is in contrast to multinomial A Binomial Distribution histogram in which a random variable can take on more than just two discrete values. This shouldn’t be confused with multivariate which refers to a situation where there are multiple variables. The resulting distribution is a binomial distribution, such as: The binomial distribution has the following properties: µ = np √ σ= 136 np(1 − p) CHAPTER 5. PROBABILITY 137 5.12. COMMON PROBABILITY DISTRIBUTIONS The binomial distribution is expressed as: ( ) P (X = k) = n p(1 − p)n−k k A binomial random variable X is denoted: X ∼ Bin(n, p) Here X ends up being the number of events that occurred over our trials. Its expected value is: E[Z|N, p] = Np The binomial distribution has two parameters: • n - a positive integer representing the number of trials • p - the probability of an event occurring in a single trial The special case N = 1 corresponds to the Bernoulli distribution. If we have Z1 , Z2 , . . . , ZN Bernoulli random variables with the same p, then X = Z1 +Z2 +· · ·+ZN ∼ Bin(N, p). Thus the expected value of a Bernoulli random variable is p (because N = 1). Some example questions that can be answered with a binomial distribution: • • • • Out of ten tosses, how many times will this coin be heads? From the children born in a given hospital on a given day, how many of them will be girls? How many students in a given class room will have green eyes? How many mosquitoes, out of a swarm, will die when sprayed with insecticide? (Source) When the number of trials n gets large, the shape of the binomial distribution starts to approximate √ a normal distribution with the parameters µ = np and σ = np(1 − p). CHAPTER 5. PROBABILITY 137 5.12. COMMON PROBABILITY DISTRIBUTIONS 138 Negative Binomial Distribution The negative binomial distribution is a more general form of the geometric distribution; instead of giving the probability of the first success in the nth trial, it gives the probability of an arbitrary kth success in the nth trial. Like the geometric and binomial distribution, it is expected that the trials are iid Bernoulli random variables. The other requirement is that the last trial is a success. This distribution is described as: ( ) n−1 k P (k|n) = p (1 − p)n−k k −1 Poisson Distribution The Poisson distribution is useful for describing the number of rare (independent) events in a large population (of independent individuals) during some time span. It looks at how many times a discrete event occurs, over a period of continuous space or time; without a fixed number of trials. P (X = k) = λk e −λ , k = 0, 1, 2, . . . k! If X is Poisson-distributed, we notate it: X ∼ Poisson(λ) For the Poisson distribution λ is any positive integer. Its size is proportional to the probability of larger values in the distribution. That is, increasing λ assigns more probability to large values; decreasing it assigns more probability to small values. It is sometimes called the intensity of the distribution. For the Poisson distribution, λ is known as the “(average) arrival rate” or sometimes just the “rate”. k must be a non-negative integer (e.g. 0, 1, 2, . . . ). A shorthand for saying that X has a Poisson mass distribution is X ∼ Poisson(λ). For Poisson distributions, the expected value of our random variable is equal to the parameter λ, that is: E[X|λ] = µ = λ In the Poisson distribution figure, although it looks like the values fall off at some point, it actually has an infinite tail, so that every positive integer has some positive probability. 138 CHAPTER 5. PROBABILITY 139 5.12. COMMON PROBABILITY DISTRIBUTIONS Poisson Distributions Example On average, 9 cars pass this intersection every hour. What is the probability that two cars pass the intersection this hour? Assume a Poisson distribution. This problem can be framed as: what is P (x = 2)? We know the expected value is 9 and that we have a Poisson distribution, so λ = 9 and: P (x = 2) = 92 −9 e 2! Some example questions that can be answered with a Poisson distribution: • • • • • How How How How How many many many many many pennies will I encounter on my walk home? children will be delivered at the hospital today? products will I sell after airing a new television commercial? mosquito bites did you get today after having sprayed with insecticide? defects will there be per 100 metres of rope sold? (Source) 5.12.2 Probability density functions Uniform distribution With the uniform distribution, every value is equally likely. It may be constrained to a range of values as well. CHAPTER 5. PROBABILITY 139 5.12. COMMON PROBABILITY DISTRIBUTIONS 140 Exponential Distribution A random variable which is continuous may have an exponential density, often describe as an exponential random variable: fX (x|λ) = λe −λx , x ≥ 0 Here we say X is exponential: X ∼ Exp(λ) Like the Poisson random variable, the exponential random variable can only have positive values. But because it is continuous, it can also take on non-integral values such as 4.25. For exponential distributions, the expected value of our random variable is equal to the inverse of the parameter λ, that is: E[X|λ] = 1 λ An Exponential Distribution 140 CHAPTER 5. PROBABILITY 141 5.12. COMMON PROBABILITY DISTRIBUTIONS Example Say we have the random variable y which is the exact amount of rain we will get tomorrow, in inches. What is the probability that y = 2 ± 0.1? Assume you have the probability density function f for y . We’d notate the probability we’re looking for like so: P (|Y − 2| < 0.1) Which is the probability that y ≈ 2 within a tolerance (acceptable deviance) of 0.1 (i.e. 1.9 to 2.1). Then we would just find the integral (area under the curve) of the PDF from 1.9 to 2.1, i.e. ∫ 2.1 f (x)d 1.9 Gamma Distribution X ∼ Gam(α, β) This is over positive real numbers. It is just a generalization of the exponential random variable: Exp(β) ∼ Gam(1, β) The PDF is: f (x|α, β) = β α x α−1 e βx Γ(α) Where Γ(α) is the Gamma function. Normal (Gaussian) Distribution The normal distribution is perhaps the most common probability distribution, occurring very often in nature. For a random variable x, the normal probability density function is1 : 1 exp(x) = e x CHAPTER 5. PROBABILITY 141 5.12. COMMON PROBABILITY DISTRIBUTIONS 142 1 (x − µ)2 P (x) = √ exp(− ) 2σ 2 σ 2π The (univariate) Gaussian distribution is parameterized by µ and σ (for multivariate Gaussian distributions, see below). The peak of the distribution is where x = µ. Gaussian distributions The height and width of the distribution varies according to σ. The lower σ is, the higher and thinner the distribution is; the higher σ is, the lower and wider the distribution is. The standard normal distribution is just N(0, 1). The Gaussian distribution can be used to approximate other distributions, such as the binomial distribution when the number of experiments is large, or the Poisson distribution when the average arrival rate is high. A normal random variable X is denoted: X ∼ N(µ, σ) Where the parameters are: • µ = the mean 142 CHAPTER 5. PROBABILITY 143 5.12. COMMON PROBABILITY DISTRIBUTIONS • σ = the standard deviation The expected value is: E[X|µ, σ] = µ t Distribution For small sample sizes (n < 30), the distribution of the sample mean deviates slightly from the normal distribution since the sample mean doesn’t exactly match the population’s mean. This distribution is the t-distribution. This distribution is the t-distribution, which, for large enough sample sizes (>= 30), converges to the normal distribution, so it may also be used for large sample sizes too. The t-distribution has thicker tails than the normal distribution, so observations are more likely to be within two standard deviations of its mean. This allows for more accurate estimations of the standard error for small sample sizes. The t-distribution is always centered around zero and is described by one parameter: the degrees of freedom. The higher the degrees of freedom, the closer the t-distribution is to the standard normal distribution. The confidence interval is computed slightly differently for a t distribution. Instead of the Z score we use a cutoff, tdf , determined by the degrees of freedom for the distribution. For a single sample with n observations, the degrees of freedom is df = n − 1. For two samples, you can use a computer to calculate the degrees of freedom, or you can choose the smallest sample size minus 1. The t-distribution’s corresponding test is the t-test, sometimes called the “Student t-test”, which is used to compare the means of two groups. From the t-distribution we can calculate a t value: t= x̄ − µ √ s/ n Then we can use this t value with the t distribution with the degrees of freedom for the sample and use that to compute a p-value. Beta Distribution For an event with two outcomes, the beta distribution is the probability distribution of the probability of the outcome being positive. The beta distribution’s domain is [0, 1] which makes it appropriate for this use. CHAPTER 5. PROBABILITY 143 5.12. COMMON PROBABILITY DISTRIBUTIONS 144 That is, in a beta distribution both the y and the x axes represent probabilities. The x-axis is the possible probabilities for the events in question, and the y -axis is the probability that that possible probability is the true probability. It is notated: Beta(α, β) Where α is the number of positive examples and β is the number of negative examples. Its PDF is: f (x|α, β) = x α−1 (1 − x)β−1 B(α, β) Where B(α, β) is the Beta function. The Beta distribution is a generalization of the uniform distribution: Uniform() ∼ Beta(1, 1) The mean of a beta distribution is just α α+β which is pretty straightforward if you think about it. If you need to estimate the probability of something happening, the beta distribution can be a good prior since it is quite easy to calculate its posterior distribution: Beta(αposterior , βposterior ) = Beta(αlikelihood + αprior , βlikelihood + βprior ) That is, you just use some plausible prior values for α and β such that you have a plausible mean, then just add your new positive and negative examples to update the beta distribution. Weibull Distribution The Weibull distribution is used for modeling reliability or “survival” data, e.g. for dealing with failure-rates. It is defined as: k−1 k x k e −(x/λ)k fx (x) = λ 0 x ≥0 x <0 The k parameter is the shape parameter and the λ parameter is the scale parameter of the distribution. If x is the “time-to-failure”, the Weibull distribution describes the failure rate over time. In this case, the parameter k influences how the failure rate changes over time: if k < 1, the failure rate decreases 144 CHAPTER 5. PROBABILITY 145 5.13. PARETO DISTRIBUTIONS over time (for instance, defective products fail early and are weeded out), if k = 1, the failure rate is constant, and if k > 1, then the failure rate increases with time (e.g. parts degrade over time). Chi-square (χ2 ) distribution The χ2 distribution is closely related to the normal distribution and often used as a sampling distribution. The χ2 distribution with f degrees of freedom, sometimes notated χ2[f ] , is the sum of the squares of f independent standard normal (e.g. µ = 0, σ = 1) variates, i.e.: Y = X12 + X22 + · · · + Xf2 This distribution has a mean f and a variance of 2f (this comes from the additive property of the mean √ and the variance). The skewness of the distribution follows the same additive property and is f8 . So when f is small, the distribution skews to the right, and the skewness decreases as f increases. When f is very large the distribution approaches the standard normal distribution (by the central limit theorem). Some χ2 distributions 5.13 Pareto distributions A Pareto distribution has a CDF with the form: CHAPTER 5. PROBABILITY 145 5.14. MULTIPLE RANDOM VARIABLES 146 CDF(x) = 1 − ( x −α ) xm They are characterized as having a long tail (i.e. many small values, few large ones), but the large values are large enough that they still make up a disproportionate share of the total (e.g. the large values take up 80% of the distribution, the rest are 20%). Such a distribution is described as scale-free since they are not centered around any particular value. Compare this to Gaussian distributions which are centered around some value. Such a distribution is said to obey the power law. A distribution P (k) obeys the power law if, as k gets large, P (k) is asymptotic to k −γ , where γ is a parameter that describes the rate of decay. Such distributions are (confusingly) sometimes called scaling distributions because they are invariant to changes of scale, which is to say that you can change the units the quantities are expressed in and γ doesn’t change. 5.14 Multiple random variables In the real world you often work with multiple random variables simultaneously - that is, you are working in higher dimensions. You could describe a group of random variables as a random vector, i.e. a random vector X ∈ Rd , where d is the number of dimensions (the number of random variables) you are working in, i.e. X = [X1 , . . . , Xd ]. A distribution over multiple random variables is called a joint distribution. For a joint distribution P (a, b), the distribution of a subset of variables is called a marginal distribution (or just marginal) of the joint distribution, and is computed: P (a) = ∑ P (a, b) b That is, fix b to each of its possible outcomes and sum those probabilities. Generally, you can compute the marginal like so: P (x1 , . . . , xi−1 , xi+1 , . . . , xn ) = ∑ P (x1 , . . . , xn ) xi So you take the variable you want to remove and sum over the probabilities with it fixed for each of its possible outcomes. The distribution over multiple random variables is called a joint distribution. When we have multiple random variables, the distribution of some subset of those random variables is the marginal distribution for that subset. The probability density function for a joint distribution just takes more arguments, i.e.: 146 CHAPTER 5. PROBABILITY 147 5.14. MULTIPLE RANDOM VARIABLES P (a1 ≤ X1 ≤ b1 , a2 ≤ X2 ≤ b2 , . . . , an ≤ Xn ≤ bn ) = 5.14.1 ∫ b1 a1 ∫ b2 a2 ··· ∫ bn an f (x1 , x2 , . . . , xn )dx1 dx2 . . . dxn Conditional distributions Conditional distributions are distributions in which the value of one or more other random variables are known. For random variables X, Y , the conditional probability of X = a given Y = b is: P (X = a|Y = b) P (X = a, Y = b) P (Y = b) which is undefined if P (Y = b) = 0. This can be expanded to multiple given random variables: P (X = a|Y = b, Z = c) P (X = a, Y = b, Z = c) P (Y = b, Z = c) The conditional distribution of X, conditioned on Y = b, is notated P (X|Y = b). More generally, we can describe the conditional distribution of X conditioned on Y on all its values as P (X|Y ). For continuous random variables, the probability of the random variable being a given specific value is 0 (see the section on probability density functions), so here we have the denominator as 0, which won’t do. However, it can be shown that the probability density function f (y |x) underlying the distribution P (Y |X) is given by: f (y |x) = f (x, y ) f (x) And thus: P (a ≤ Y ≤ b|X = c) = 5.14.2 ∫ b a f (y |c)dy = ∫ a b f (c, y ) dy f (c) Multivariate Gaussian A random vector X is (multivariate) Gaussian (or “normal”) if any linear combination is (univariate) Gaussian. CHAPTER 5. PROBABILITY 147 5.14. MULTIPLE RANDOM VARIABLES 148 Note that “Gaussian” often implies “multivariate Gaussian”. That is, the dot product of some vector a transpose with X, which is: T a X= n ∑ ai X i i=1 is Gaussian for every a ∈ Rn . We say X is (multivariate) Gaussian distributed with mean µ (where µ ∈ Rn , that is, µ is a vector as well) and covariance matrix C, notated: X ∼ N(µ, C) which means X is Gaussian with E[Xi ] = µi and Cij = Cov(Xi , Xj ) and Cii = Var(Xi ). µ and C are the parameters of the distribution. If X is a random vector, X ∈ Rn , i.e. [X1 , . . . , Xn ], and a Gaussian, i.e. X ∼ N(µ, C) where µ is a vector µ = [µ1 , . . . , µn ], and if the covariance matrix C has the variances on its diagonal and 0 everywhere else, like below, then the components of X are independent and individually normally distributed, i.e. Xi ∼ N(µi , σi2 ). Caveat: a random vector’s individual components being Gaussian but not independent does not necessarily imply that the vector itself is Gaussian. σ12 C= 0 .. . 0 0 ··· 0 .. .. .. . . . = diag(σ 2 , . . . , σ 2 ) 1 n .. .. . 0 . · · · 0 σn2 Intuitively this makes sense because if Xi and Xj are independent, then their covariance Cov(Xi , Xj ) = 0. So all the i, j entries in C where i ̸= j are 0. This does not necessarily hold for non-Gaussians; this is a property particular to Gaussians. Degenerate univariate Gaussian A degenerate univariate Gaussian distribution is one where X ≡ µ, that is: X ∼ N(µ, 0). Degenerate multivariate Gaussian A multivariate Gaussian can also be degenerate, which is when the determinant of its covariance matrix C is 0. 148 CHAPTER 5. PROBABILITY 149 5.14. MULTIPLE RANDOM VARIABLES Examples of Gaussians (and non-examples) These are some examples of what Gaussians can look like. Drawn over the first two are their level sets which demarcate where the density is constant (you can think of it like a topographical map). The last example is a degenerate Gaussian. Some example Gaussians. The last example is a degenerate Gaussian Some examples that are not Gaussians Probability density function A multivariate Gaussian random variable X ∼ N(µ, C) only has a density function if it is nondegenerate (i.e. det(C) ̸= 0). The PDF is: 1 exp(− (x − µ)T C −1 (x − µ)) 2 |2πC| √ 1 Note that |A| = det(A), and the term |2πC| can be also written as (2π)n det(C). Affine property An affine transformation is just some function in the form f (x) = Ax + b. CHAPTER 5. PROBABILITY 149 5.15. BAYES’ THEOREM 150 Any affine transformation of a Gaussian random variable is itself a Gaussian. If X ∼ N(µ, C), then AX + b ∼ N(Aµ + b, ACAT ). Marginal distributions of a Gaussian The marginal distributions of a Gaussian are also Gaussian. More formally, if you have a Gaussian random vector X ∈ Rn , X ∼ N(µ, C) which you decompose into Xa = [X1 , . . . , Xk ], Xb = [Xk+1 , . . . , Xn ], where 1 ≤ k ≤ n, then Xa ∼ N(µa , Caa ), Xb ∼ N(µb , Cbb ). Conditional distributions of a Gaussian The conditional distributions of a Gaussian are also Gaussian. More formally, if you have a Gaussian random vector X ∈ Rn , X ∼ N(µ, C) which you decompose into Xa = [X1 , . . . , Xk ], Xb = [Xk+1 , . . . , Xn ], where 1 ≤ k ≤ n, then (Xa |Xb = xb ) ∼ N(m, D), −1 −1 where m = µa + Cab Cbb (xb − µb ) and D = Caa − Cab Cbb Cba . Sum of independent Gaussians The sum of independent Gaussians is also Gaussian. More formally, if you have Gaussian random vectors X ∈ Rn , X ∼ N(µx , Cx ) and Y ∈ Rn , Y ∼ N(µy , Cy ) which are independent, then X + Y ∼ N(µx + µy , Cx + Cy ) 5.15 Bayes’ Theorem 5.15.1 Intuition The probability of both a and b occurring is P (a ∩ b) (the probability of a or b occurring). This is the same as the probability of a occurring given b and vice versa: P (a ∩ b) = P (a|b)P (b) = P (b|a)P (a) This can be rearranged to form Bayes’ Theorem: P (b|a) = P (a|b)P (b) P (a) Bayes’ Theorem is useful for answering questions such as, “How likely is A given B?”. For example, “How likely is my hypothesis true given the data I have?” 150 CHAPTER 5. PROBABILITY 151 5.15.2 5.15. BAYES’ THEOREM A Visual Explanation This explanation is adapted from Count Bayesie. The accompanying figure depicts a 6x10 area (60 pegs total) of lego bricks representing a probability space with the following probabilities: P (blue) = 40/60 = 2/3 P (red) = 20/60 = 1/3 Lego Probability Space Red and blue alone describe the entire set of possible events. Yellow pegs are conditional upon the red and blue bricks; that is, their probabilities are conditional upon what color brick is underneath it. So the following probability properties of yellow should be straightforward: P (yellow) = 6/60 = 1/10 P (yellow|red) = 4/20 = 1/5 P (yellow|blue) = 2/40 = 1/20 But say you want to figure out P (red|yellow) This is intuitive visually in this example. You’d reason that there are 6 yellow pegs total, 4 of which are on the red space, so there’s 4/6 probability that we are in the red space for a given yellow peg. This intuition is Bayes’ Theorem, and can be written more formally as: P (red|yellow) = P (yellow|red)P (red) P (yellow) Step by step, what we did was: nyellow = P (yellow) ∗ ntotal = 1/10 ∗ 60 = 6 nred = P (red) ∗ ntotal = 1/3 ∗ 60 = 20 nyellow|red = P (yellow|red) ∗ nred = 1/5 ∗ 20 = 4 nyellow|red = 4/6 = 2/3 P (red|yellow) = nyellow If you expand out the last equation, you’ll find Bayes’ Theorem: CHAPTER 5. PROBABILITY 151 5.15. BAYES’ THEOREM 152 nyellow|red nyellow P (yellow|red) ∗ nred = P (yellow) ∗ ntotal P (yellow|red) ∗ P (red) ∗ ntotal = P (yellow) ∗ ntotal P (yellow|red)P (red) = P (yellow) P (red|yellow) = 5.15.3 An Example Bayes’ Problem Consider the following problem: 1% of women at age forty who participate in routine screening have breast cancer. 80% of women with breast cancer will get positive mammographies. 9.6% of women without breast cancer will also get positive mammographies. A woman in this age group had a positive mammography in a routine screening. What is the probability that she actually has breast cancer? Intuitively it’s difficult to get the correct answer. Generally, only ~15% doctors can get it right (Casscells, Schoenberger, and Grayboys 1978; Eddy 1982; Gigerenzer and Hoffrage 1995; and many other studies.) You can work through the problem like so: • 1% of women at age forty have breast cancer. To simplify the problem, assume there are 1000 women total, so 10/1000 have breast cancer. • 80% of women w/ breast cancer will get positive mammographies. So of the 10 women that have breast cancer, 8/1000 of them will get positive mammographies. • 9.6% of women without breast cancer will also get positive mammographies. We have 10/1000 women with breast cancer, which means there are 990 without breast cancer. Of those 990, 9.6% will also get positive mammographies, so ~95/1000 women are false positives. We can rephrase the problem like so: What is the probability that a woman in this age group has breast cancer, if she gets a positive mammography? In total, the number of positives we have are 95 + 8 = 103. Then we can just use simple probability: there’s an 8/103 chance (7.8%) that she has breast cancer, and a 95/103 chance (92.2%) that she’s a false positive. One way to interpret these results is that, in general, women of age forty have a 1% chance of having breast cancer. Getting a positive mammography does not indicate that you have breast cancer, it just “slides up” your probability of having it to 7.8%. We could break up the group of 1000 women into: 152 CHAPTER 5. PROBABILITY 153 • • • • 5.15. BAYES’ THEOREM True positives: 8 False positives: 95 True negatives: 990 - 95 = 895 False negatives: 10 - 8 = 2 Which totals to 1000, so everyone is accounted for. 5.15.4 Solving the problem with Bayes’ Theorem The original proportion of patients w/ breast cancer is the prior probability. The probability of a true positive and the probability of a false positive are the conditional probabilities. Collectively, this information is known as the priors. The priors are required to solve a Bayesian problem. The final answer - the estimated probability that a patient has breast cancer given a positive mammography - is the revised probability, better known as the posterior probability. If the two conditional probabilities are equal, the posterior probability equals the prior probability (i.e. if there’s an equal chance of getting a false and a negative positive, then the test really tells you nothing). 5.15.5 Another Example Your friend reads you a study which found that only 10% of happy people are rich. Your friend concludes that money can’t buy happiness. How could you show them otherwise? Rather than asking “What percent of happy people are rich?”, it is probably better to ask “What percent of rich people are happy?” to determine if money buys happiness. With the statistic from the study, statistics about the overall rate of happy people (say 40% of people are happy) and rich people (say 5% of people are rich), and Bayes’ Theorem, you can calculate this value: 10% × 40% = 80% 5% So it seems like a lot of rich people are happy. 5.15.6 Naive Bayes Bayes’ rule: P (a|b) = CHAPTER 5. PROBABILITY P (b|a)P (a) P (b) 153 5.15. BAYES’ THEOREM 154 Say a is a class and b is some evidence. We’ll notate the class as c and the evidence as e. We are interested in: what’s the probability of a class c given some evidence e? We can write this question out as Bayes’ rule: P (c|e) = P (e|c)P (c) P (e) Our evidence may actually be multiple pieces of evidence: e1 , . . . , en . So instead we can re-write the equation as: P (e1 , . . . , en |c)P (c) P (e1 , . . . , en ) If we can assume that each piece of evidence is independent given the class c, then we can further write this as: [ ∏n P (ei |c)]P (c) P (e1 , . . . , en ) i Example In practice: say I have two coins. One is a fair coin (P (heads) = 0.5) and one is a trick coin (P (heads) = 0.8). I pick one of the coins at random and flip it twice, getting heads and then tails. Which coin did I pick? The head and tail outcomes are our evidence. So we can take the product of the probabilities of these outcomes given a particular class. The probability of picking either coin was uniform, i.e. there was a 50% chance of picking either. So we can ignore that probability. For a fair coin, the probability of getting heads and then tails is P (H|fair) × P (T |fair) = 0.5 × 0.5 = 0.25. For the trick coin, the probability is P (H|trick) × P (T |trick) = 0.8 × 0.2 = 0.16. So it’s more likely that I picked the fair coin. If we flip again and get a heads, things change a bit: For a fair coin: P (H|fair) × P (T |fair) × P (H|fair) = 0.5 × 0.5 × 0.5 = 0.125. For the trick coin: P (H|trick) × P (T |trick) × P (H|trick) = 0.8 × 0.2 × 0.8 = 0.128. So now it’s slightly more likely that I picked the trick coin. 154 CHAPTER 5. PROBABILITY 5.16. THE LOG TRICK 155 5.16 The log trick When working with many independent probabilities, which is often the case in machine learning, you have to multiply many probabilities which can result in underflow. So it’s often easier to work with the logarithm of probability functions, which is fine because when optimizing, the max (or min) will be at the same location in the logarithm form (though their actual values will be different). Using logarithms will allow us to sum terms instead of multiplying them. 5.17 Information Theory Information, measured in bits, answers questions - the more initial uncertainty there is about the answer, the more information the answer contains. The amount of bits needed to encode an answer depends on the distribution over the possible answers (i.e., the uncertainty about the answer). Examples: • the answer to a boolean question with a prior (0.5, 0.5) requires 1 bit to encode (i.e. just 0 or 1) • the answer to a 4-way question with a prior (0.25, 0.25, 0.25, 0.25) requires 2 bits to encode • the answer to a 4-way question with a prior (0, 0, 0, 1) requires 0 bits to encode, since the answer is already known (no uncertainty) • the answer to a 3-way question with prior (0.5, 0.25, 0.25) requires, on average, 1.5 bits to encode More formally, we can compute the average number of bits required to encode uncertain information as follows: ∑ pi log2 i 1 pi This quantity is called the entropy of the distribution (H), and is sometimes written as the equivalent: H(p1 , . . . , pn ) = ∑ −pi log2 pi i If you do something such that the answer distribution changes (e.g. observe new evidence), the difference between the entropy of the new distribution and the entropy of the old distribution is called the information gain. The basic intuition behind information theory is that learning that an unlikely event has occurred is more informative than learning that a likely event has occurred. A message saying “the sun rose this morning” is so uninformative as to be unnecessary to send, but a message saying “there was a solar eclipse this morning” is very informative. CHAPTER 5. PROBABILITY 155 5.17. INFORMATION THEORY 156 The self-information of an event is: I(X) = − ln P (X) and, when using natural log, is measured in nats (when using log2 , it is measured in bits or shannons). One nat is the information gained by observing an event of probability e1 . Self-information only measures a single event; to measure the amount of uncertainty in a complete probability distribution we can instead use the Shannon entropy, which tells us the expected information of an event drawn from that distribution: H(X) = EX∼P [I(X)] = −EX∼P [ln P (X)] When X is continuous, Shannon entropy is called the differential entropy. 5.17.1 Entropy Broadly, entropy is the measure of disorder in a system. In the case of probability, it is the measure of uncertainty that is associated with the distribution of a random variable. If there are a few outcomes which are fairly certain, the system has low entropy. A point-mass distribution has the lowest entropy. We know exactly what value we’ll get from it. If there are many outcomes which are equiprobable, the system has high entropy. A uniform distribution has the highest entropy. We don’t really have any idea of what value we’ll draw from it. To put it another way: with high entropy, it is very hard to guess the value of the random variable (because all values are equally or similarly likely); with low entropy it easy to guess its value (because there are some values which are much more likely than the others). The entropy of a random variable X is notated H(X), must be H(X) ≥ 0 and is calculated: H(X) = −E[lg(P (X))] =− ∑ x =− ∫ P (x) lg(P (x)) (discrete) ∞ −∞ P (x) lg(P (x))dx (continuous) Where lg(x) = log2 (x). This does not say anything about the value of the random variable, only the spread of its distribution. For example: what is the entropy of a roll of a six-sided die? 1 1 1 1 1 1 1 1 1 1 1 1 1( lg( ) + lg( ) + lg( ) + lg( ) + lg( ) + lg( )) = 2.58 6 6 6 6 6 6 6 6 6 6 6 6 156 CHAPTER 5. PROBABILITY 157 5.17. INFORMATION THEORY The Maxmimum Entropy Principle says that, all else being equal, we should prefer distributions that maximize the entropy. That is, you should be conservative in your confidence about how much you know - if you don’t have any good reason for something to be more likely than something else, err on the side of them being equiprobable. 5.17.2 Specific Conditional Entropy The specific conditional entropy H(Y |X = v ) is the entropy of some random variable conditioned on another random variable taking some value. 5.17.3 Conditional Entropy The conditional entropy H(Y |X) is the entropy of some random variable conditioned on another random variable, i.e. it is the average specific conditional entropy of Y , that is: ∑ P (X = v )H(Y |X = v ) j 5.17.4 Information Gain Say you must transmit the random variable Y . How many bits on average would be saved if both the sender and the recipient knew X? To put it more concretely: IG(Y |X) = H(Y ) − H(Y |X) The bigger the difference, the more X tells us about Y (because it decreases the entropy, i.e. it makes it easier to guess Y ). 5.17.5 Kullback-Leibler (KL) divergence We can measure the difference between two probability distributions P (X), Q(X) over the same random variable X with the KL divergence: DKL (P ||Q) = EX∼P [ln P (X) ] = EX∼P [ln P (X) − ln Q(X)] Q(X) The KL divergence has the following properties: CHAPTER 5. PROBABILITY 157 5.17. INFORMATION THEORY 158 • It is non-negative • It is 0 if and only if: – P and Q are the same distribution (for discrete variables) – P and Q are equal “almost everywhere” (for continuous variables) • It is not symmetric, i.e. DKL (P ||Q) ̸= DKL (Q||P ), so it is not a true distance metric The KL divergence is related to cross entropy H(P, Q): H(P, Q) = H(P ) + DKL (P ||Q) = EX∼P log Q(X) Given some discrete random variable X with possible outcomes {x1 , x2 , . . . , xm }, and the probability distribution of these outcomes, p(X), we can compute the entropy H(X): H(X) = − m ∑ P (xi ) logb P (xi ) i Usually b = 2 (i.e. we use bits). The mutual information between two discrete random variables X and Y can be computed: ∑ ∑ I(X; Y ) = p(x, y ) log( y ∈Y x∈X p(x, y ) ) p(x)p(y ) For continuous random variables, it is instead computed: ∫ ∫ p(x, y ) log( I(X; Y ) = Y X p(x, y ) )dxdy p(x)p(y ) The variation of information between two random variables is computed: V I(X; Y ) = H(X) + H(Y ) − 2I(X, Y ) The Kullback-Leibler divergence tells us the difference between two probability distributions P and Q. It is non-symmetric. For discrete probability distributions, it is calculated: DKL (P ||Q) = ∑ P (i ) log i P (i ) Q(i ) For continuous probability distributions, it is computed: 158 CHAPTER 5. PROBABILITY 159 5.18. REFERENCES DKL (P ||Q) = ∫ ∞ −∞ p(i ) log p(i ) dx q(i ) 5.18 References • • • • • • • • • • • • • • • • • • Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon. Parameter Estimation - The PDF, CDF and Quantile Function. Count Bayesie. Will Kurt. What is the intuition behind beta distribution?. David Robinson, KerrBer. Distributions of One Variable. An Introduction to Statistics with Python. Thomas Haslwanter. Probability Theory Review for Machine Learning. Samuel Ieong. November 6, 2006. MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. Principles of Statistics. M.G. Bulmer. 1979. OpenIntro Statistics, Second Edition. David M Diez, Christopher D Barr, Mine ÇetinkayaRundel. A Beginner’s Guide to Eigenvectors, PCA, Covariance and Entropy. Deeplearning4j. Skymind. An Intuitive Explanation of Bayes’ Theorem. Eliezer S. Yudkowsky. Why so Square? Jensen’s Inequality and Moments of a Random Variable. Count Bayesie. Will Kurt. What is an intuitive explanation of Bayes’ Rule?. Mike Kayser. Bayes’ Theorem with Lego. Count Bayesie. Will Kurt. Probability, Paradox, and the Reasonable Person Principle. Peter Norvig. October 3, 2015. Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). Think Complexity. Version 1.2.3. Allen B. Downey. 2012. Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman. CHAPTER 5. PROBABILITY 159 5.18. REFERENCES 160 160 CHAPTER 5. PROBABILITY 161 6 Statistics Broadly, statistics is concerned with collecting and analyzing data. It seeks to describe rigorous methods for collecting data (samples), for describing the data, and for inferring conclusions from the data. There are processes out there in the world that generate observable data, but these processes are often black boxes and we want to gain some insight into how they work. We can crudely cleave statistics into two main practices: descriptive statistics, which provides tools for describing data, and inferential statistics, which provides tools for learning (inferring or estimating) from data. This section is focused on frequentist (or classical) statistics, which is distinguished from Bayesian statistics (covered in another chapter). 6.0.1 Notation • Regular letters, e.g. X, Y , typically denote observed (known) variables • Greek letters, e.g. µ, σ, typically denote unknown variables which we are trying to estimate • Hats over letters, e.g. θ̂, denote estimators (an estimator is a rule for calculating an estimate given some observed data), e.g. an estimated value for a parameter. 6.1 Descriptive Statistics Descriptive statistics involves computing values which summarize a set of data. This typically includes statistics like the mean, standard deviation, median, min, max, etc, which are called summary statistics. 6.1.1 Scales of Measurement In statistics, numbers and variables are categorized in certain ways. CHAPTER 6. STATISTICS 161 6.1. DESCRIPTIVE STATISTICS 162 Variables may be categorical (also called qualitative), in which they represent discrete values (numbers here are arbitrarily assigned to represent categories of qualities), or numerical (also called quantitative) in which they represent continuous values. These variables are further categorized into scales of measurement. • Nominal: Includes qualitative variables that can only be counted; they have no order or intervals. – Example: Gender, marital status • Ordinal: Includes qualitative variables that have a concept of order, so they can be arranged into some sequence accordingly and meaningfully ranked. But they are without any measure of magnitude between items in that sequence. So some object A may come after some object B but there is no measurement of interval between the two (we can’t, for instance, say that A is 10 more than B). – Example: Education level (some high school, high school, college, etc) • Interval: Interval variables are quantitative variables; in some sense they are like ordinal variables that do have a measure of interval between items. But they do not have an absolute zero point, so we can’t compare values as ratios (we can’t, for instance, say A is twice of B). – Example: Dates (we can say how many days there are between two dates, but, for example, we can’t say one date is twice of another) • Ratio: Ratio variables are like interval variables (also quantitative variables) but have a fixed and meaningful zero point, so they can be compared as ratios. – Example: Age, length 6.1.2 Averages The average of a set of data can be described as its central tendency; which gives some sense of a typical or common value for a variable. There are three types: Arithmetic mean Often just called the “mean” and notated µ (mu). For a dataset {x1 , . . . , xn }, the arithmetic mean is: ∑n i=1 xi n The mean can be sensitive to extreme values (outliers), which is one reason the median is sometimes used instead. Which is to say, the median is a more robust statistic (meaning that it is less sensitive to outliers). Note that there are other types of means, but the arithmetic mean is by far the most common. 162 CHAPTER 6. STATISTICS 163 6.1. DESCRIPTIVE STATISTICS Median The central value in the dataset, e.g. 11234 median = 2 If there are even number of values, you just take the value between the two central values: 112344 median = 2+3 = 2.5 2 Mode The most frequently occurring value in the dataset, e.g. 12332343 mode = 3 Central tendencies for a distribution 6.1.3 Population vs Sample With statistics we take a sample of a broader population or already have data which is a sample from a population. We use this limited sample in order to learn things about the whole population. CHAPTER 6. STATISTICS 163 6.1. DESCRIPTIVE STATISTICS 164 The mean of the population is denoted µ and consists of N items, whereas the mean of the sample (i.e. the sample mean, sometimes called the empirical mean) is notated x̄ or µ̂ and consists of n items. The sample mean is: µ̂ = 1 ∑ (i) x n i The sample variance is: σ̂ 2 = 1 ∑ (i) (x − µ̂)2 n−1 i The sample covariance matrix is: Σ̂ = 1 ∑ (i) (x − µ̂)(x (t) − µ̂)T n−1 i These estimators are unbiased, i.e.: E[µ̂] = µ E[σ̂ 2 ] = σ 2 E[Σ̂] = Σ 6.1.4 Independent and Identically Distributed Often in statistics we assume that a sample is independent and identically distributed (iid); that is; that the data points are independent from one another (the outcome of one has no influence over the outcome of any of the others) and that they share the same distribution. We say that X1 , . . . , Xn are iid if they are independent and drawn from the same distribution, that is P (X1 ) = · · · = P (Xn ). This can also be stated: P (X1 , . . . , Xn ) = ∏ P (Xi ) i In this case, they all share the same mean (expected value) and variance. This assumption makes computing statistics for the sample much easier. For instance, if a sample was not identically distributed, each datapoint might come from a different distribution, in which case there are different means and variances for each datapoint which must be computed from each of those datapoints alone. They can’t really be treated as a group since the datapoints aren’t quite equivalent to each other (in a statistical sense). 164 CHAPTER 6. STATISTICS 165 6.1. DESCRIPTIVE STATISTICS Or, if the sample was not independent, then we lose all the conveniences that come with independence. The IID assumption doesn’t always hold (i.e. it may be violated), of course, so there are other ways of approaching such situations while minimizing complexity, such as Hidden Markov Models. 6.1.5 The Law of Large Numbers (LLN) Let X1 , . . . Xn be iid with mean µ. The law of large numbers essentially states that as a sample size approaches infinity, its mean will approach the population (“true”) mean: lim n→∞ 6.1.6 n 1∑ Xi = µ n i=1 Regression to the mean P (Y < x|X = x) gets bigger as x approaches very large values. That is, given a very large X (an extreme), the chance that Y ’s value is as large as or larger than X is unlikely. P (Y > x|X = x) gets bigger as x approaches very small values. That is, given a very small X (an extreme), the chance that Y ’s value is as small as or smaller than X is unlikely. 6.1.7 Central Limit Theorem (CLT) Say you have a set of data. Even if the distribution of that data is not normal, you can divide the data into groups (samples) and then average the values of those groups. Those averages will approach the form of a normal curve as you increase the size of those groups (i.e. increase the sample size). Let X1 , . . . Xn be iid with mean µ and variance σ 2 . Then the central limit theorem can be formalized as: √ n n 1∑ D (( → N(0, 1) Xi ) − µ) − σ 2 n i=1 That is, the left side converges in distribution to a normal distribution with mean 0 and variance 1 as n increases. 6.1.8 Dispersion (Variance and Standard Deviation) Dispersion is the “spread” of a distribution - how spread it out its values are. The main measures of dispersion are the variance and the standard deviation. CHAPTER 6. STATISTICS 165 6.1. DESCRIPTIVE STATISTICS 166 Standard Deviation Standard deviation is represented by σ (sigma) and describes the variation from the mean (µ, i.e. the expected value), calculated: √ σ= E[(X − µ)2 ] = √ E[X 2 ] − (E[X])2 Variance The square of the standard deviation, that is, E[(X − µ)2 ], is the variance of X, usually notated σ 2 . It can also be written: ∑N Var(X) = σ = E(X ) − E(X) = 2 2 2 i=1 (xi − µ2 ) N For a population of size N of datapoints x. That is, variance is the difference between the square of the inputs and the square of the expected value. Coefficient of variation (CV) Variance depends on the units of measurement, but this can be controlled for by computing the coefficient of variation: CV = σ × 100 x̄ This allows us to compare variability across variables measured in different units. Variance of a linear combination of random variables The variance of a linear combination of (independent) random variables, e.g. aX + bY , can be computed: Var(β1 X1 + · · · + βn Xn ) = β12 Var(X1 ) + · · · + βn2 Var(Xn ) 166 CHAPTER 6. STATISTICS 167 6.1. DESCRIPTIVE STATISTICS Range The range can also be used to get a sense of dispersion. The range is the difference between the highest and lowest values, but very sensitive to outliers. As an alternative to the range, you can look at the interquartile range, which is the range of the middle 50% of the values (that is, the difference of the 75th and 25th percentile values). This is less sensitive to outliers. Z score A Z score is just the number of standard deviations a value is from the mean. It is defined: Z= x −µ σ The Empirical Rule The empirical rule describes that, for a normal distribution, there is: • a 68% chance that a value falls within one standard deviation • a 95% chance that something falls within two standard deviations • a 99.7% chance that something falls within three standard deviations Pooled standard deviation estimates If you have reason to expect that the standard deviations of two populations are practically identical, you can use the pooled standard deviation of the two groups to obtain a more accurate estimate of the standard deviation and standard error: 2 spooled = s12 (n1 − 1) + s22 (n2 − 1) n1 + n2 − 2 Where n1 , n2 , s1 , s2 are the sample sizes and standard deviations of the sample groups. We must update the degrees of freedom as well, df = n1 + n2 − 2, which we can use for a new t-distribution. 6.1.9 Moments The kth moment, mk (X), where k ∈ 1, 2, 3, . . . (i.e. it is a positive integer), is E[X k ]. So the first moment is just the mean, E[X]. The kth central moment is E[(X − E[X])k ]. So the second central moment is just the variance, E[(X − E[X])2 ]. CHAPTER 6. STATISTICS 167 6.1. DESCRIPTIVE STATISTICS 168 The Empirical Rule The third moment is the skewness, and the fourth moment is the kurtosis; they all share the same form (with different normalization terms): 1 ] σ3 1 kurtosis = E[(X − µ)4 ] 4 σ skewness = E[(X − µ)3 Moments have different units, e.g. the first moment might be in meters (m), the second moment √ would be m2 , and so on, so it is typical to standardize moments by taking their k-th root, e.g. m2 6.1.10 Covariance The covariance describes the variance between two random variables. For random variables x and y , the covariance is (remember, E(x) denotes the expected value of random variable x): Cov(x, y ) = E[(x − E(x))(y − E(y ))] = 1∑ (xi − x̄)(yi − ȳ ) n There must the same number of values n for each. This is simplified to: 168 CHAPTER 6. STATISTICS 169 6.1. DESCRIPTIVE STATISTICS Cov(x, y ) = E[xy ] − E[y ]E[x] ≈ xy ¯ − ȳ x̄ A positive covariance means that as x goes up, y goes up. A negative covariance means that as x goes up, y goes down. Note that variance is just the covariance of a random variable with itself: Var(X) = E(XX) − E(X)E(X) = Cov(X, X) 6.1.11 Correlation Correlation gives us a measure of relatedness between two variables. Alone it does not imply causation, but it can help guide more formal inquiries (e.g. experiments) into causal relationships. A good way to visually intuit correlation is through scatterplots. We can measure correlation with correlation coefficients. These measure the strength and sign of a relationship (but not the slope, linear regression, detailed later, does that). Some of the more common correlation coefficients include: • • • • Pearson product-moment (used where both variables are on an interval or ratio scale) Spearman rank-order (where both variables are on ordinal scales) Phi (where both variables are on nominal/categorical/dichotomous/binary scales) Point biserial (where one variable is on a nominal/categorical/dichotomous/binary scale and the other is on an interval or ratio scale) The Pearson and Spearman coefficients are the most commonly used ones, but sometimes the later two are used in special cases (e.g. with categorical data). Pearson product-moment correlation coefficient (Pearson’s correlation) ∑n r= xi −x̄ yi −ȳ i=1 ( sx )( sy ) n−1 Note: this is sometimes denoted as a capital R You may recognize this as: Cov(X, Y ) SX SY CHAPTER 6. STATISTICS 169 6.1. DESCRIPTIVE STATISTICS 170 Here we convert our values to standard scores, i.e. xis−x̄ . This standardizes the values such that x their mean is 0 and their variance is 1 (and so they are unitless) For a population, r is notated ρ (rho). This value can range from [−1, 1], where 1 and -1 mean complete correlation and 0 means no correlation. To test the statistical significance of the Pearson correlation coefficient, you can use the t statistic. For instance, if you believe there is a relationship between two variables, you set your null hypothesis as ρ = 0 and then with your estimate of r , calculate the t statistic: t=√ r 1−r 2 n−2 Then look up the value in a t table. The Pearson correlation coefficient tells you the strength and direction of a relationship, but it doesn’t tell you how much variance of one variable is explained by the other. For that, you can use the coefficient of determination which is just r 2 . So for instance, if you have r = 0.9, then r 2 = 0.81 which means 81% of the variation of one variable is explained by the other. Note that Pearson’s correlation only accurately measures linear relationships; so even if you have a Pearson correlation near 0, it is still possible that there may be a strong nonlinear relationship. It’s worthwhile to look at a scatter plot to verify. It is also not robust in the presence of outliers. Spearman rank-order correlation coefficient (Spearman’s rank correlation) Here you compute ranks (i.e. the indices in the sorted sample) rather than standard scores. For example, for the dataset [1, 10, 100], the rank of the value 10 is 2 because it is second in the sorted list. Then you can compute the Spearman correlation: ∑ (6 d 2 ) rs = 1 − n(n2 − 1) Where d is the difference in ranks for each datapoint. Generally, you can interpret rs in the following ways: • 0.9 ≤ rs ≤ 1 - very strong correlation • 0.7 ≤ rs ≤ 0.9 - strong correlation • 0.5 ≤ rs ≤ 0.7 - moderate correlation 170 CHAPTER 6. STATISTICS 171 6.1. DESCRIPTIVE STATISTICS You can test its statistical significance using a z test, where the null hypothesis is that rs = 0. √ z = rs n − 1 Spearman’s correlation is more robust to outliers and skewed distributions. Point-Biserial correlation coefficient This correlation coefficient is useful when comparing a categorical (binary) variable with an interval or ratio scale variable: rpbi = Mp − Mq √ pq St Where Mp is the mean for the datapoints categorized as 1 and Mq is the mean for the datapoints categorized as 0. St is the standard deviation for the interval/ratio variable, p is the proportion of datapoints categorized as 1, and q is the proportion of datapoints categorized as 0. Phi correlation coefficient This allows you to measure the correlation between two categorical (binary) variables. It is calculated like so: A = f (0, 1) B = f (1, 1) C = f (0, 0) D = f (1, 0) AD − BC rϕ = √ (A + B)(C + D)(A + C)(B + D) Where f (a, b) is the frequency of label a and label b occurring together in the data. 6.1.12 Degrees of Freedom Degrees of freedom describes the number of variables that are “free” in what value they can take. Often a given variable must be a particular value because of the values the other variables take on and some constraint(s). For example: say we have four unknown quantities x1 , x2 , x3 , x4 . We know that their mean is 5. In this case we have three degrees of freedom - this is because three of the variables are free to take arbitrary values, but once those three are set, the fourth value must be equal to x4 = 20−x1 −x2 −x3 in order for the mean to be 5 (that is, in order to satisfy the constraint). So for instance, if x1 = 2, x2 = 4, x3 = 6, then x4 must equal 8. It is not “free” to take on any other value. CHAPTER 6. STATISTICS 171 6.1. DESCRIPTIVE STATISTICS 6.1.13 172 Time Series Analysis Often data has a temporal component; e.g. you are looking for patterns over time. Generally, time series data may have the following parts: a trend, which is some function reflecting persistent changes, seasonality; that is, periodic variation, and of course there is going to be some noise - random variation - as well. Moving averages To extract a trend from a series, you can use regression, but sometimes you will be better off with some kind of moving average. This divides the series into overlapping regions, windows, of some size, and takes the averages of each window. The rolling mean just takes the mean of each window. There is also the exponentially-weighted moving average (EWMA) which gives a weighted average, such that more recent values have the highest weight, and values before that have weights which drop off exponentially. The EWMA takes an additional span parameter which determines how fast the weights drop off. Serial correlation (autocorrelation) In time series data you may expect to see patterns. For example, if a value is low, it may stay low for a bit, if it’s high, it may stay high for a bit. These types of patterns are serial correlations, also called autocorrelation (so-called because it is correlated a dataset with itself, in some sense), because the values correlate in their sequence. You can compute serial correlation by shifting the time series by some interval, called a lag, and then compute the correlation of the shifted series with the original, unshifted series. 6.1.14 Survival Analysis Survival analysis describes how long something lasts. It can refer to the survival of, for instance, a person - in the context of disease, a 5-year survival rate is the probability of surviving 5 years after diagnosis, for example - or a mechanical component, and so on. More broadly it can be seen as looking at how long something lasts until something happens - for instance, how long until someone gets married. A survival curve is a function S(t) which computes the probability of surviving longer than duration t. Such a duration is called a lifetime. The survival curve ends up just being the complement of the CDF: S(t) = 1 − CDF(t) Looking at it this way, the CDF is the probability of a lifetime less than or equal to t. 172 CHAPTER 6. STATISTICS 173 6.2. INFERENTIAL STATISTICS Hazard function A hazard function tells you the fraction of cases that continue until t and then end at t. It can be computed from the survival curve: λ(t) = S(t) − S(t + 1) S(t) Hazard functions are also used for estimating survival curves. Estimating survival curves: Kaplan-Meier estimation Often we do not have the CDF of lifetimes so we can’t easily compute the survival curve. We often have non-survival cases alongside have survival cases, where we don’t yet know what their final lifetime will be. Often, as is the case in the medical context, we don’t want to wait to learn what these unknown lifetimes will be. So we need to estimate the survival curve with the data we do have. The Kaplan-Meier estimation allows us to do this. We can use the data we have to estimate the hazard function, and then convert that into a survival curve. We can convert a hazard function into an estimate of the survival curve, where each point at time t is computed by taking the product of complementary hazard functions through that time t, like so: ∏ (1 − λ(t)) t 6.2 Inferential Statistics Statistical inference is the practice of using statistics to infer some conclusion about a population based on only a sample of that population. This can be the population’s distribution - we want to infer from the sample data what the “true” distribution (the population distribution) is and the unknown parameters that define it. Generally, data is generated by some process; this data-generating process is also noisy; that is, there is a relatively small degree of imprecision or fluctuation in values due to randomness. In inferential statistics, we try to uncover the particular function that describes this process as closely as possible. We do so by choosing a model (e.g. if we believe it can be modeled linearly, we might choose linear regression, otherwise we might choose a different kind of model such as a probability distribution; modeling is covered in greater detail in the machine learning part). Once we have chosen the model, then we need to determine the parameters (linear coefficients, for example, or mean and variance for a probability distribution) for that model. Broadly, the two paradigms of inference are frequentist, which relies on long-run repetitions of an event, that is, it is empirical (and could be termed the “conventional” or “traditional” framework, though there’s a lot of focus on Bayesian inference now) and Bayesian, which is about generating a CHAPTER 6. STATISTICS 173 6.2. INFERENTIAL STATISTICS 174 hypothesis distribution (the prior) and updating it as more evidence is acquired. Bayesian inference is valuable because there are many events which we cannot repeat, but we still want to learn something about. The frequentist believes these unknown parameters have precise “true” values which can be (approximately) uncovered. In frequentist statistics, we can estimate these exact values. When we estimate a single value for an unknown, that estimation is called a point estimate. This is in contrast to describing a value estimate as a probability distribution, which is the Bayesian method. The Bayesian believes that we cannot express these parameters as single values and we should rather describe them as a distributions of possible values to be explicit about their uncertainty. Here we focus on frequentist inference; Bayesian inference is covered in a later chapter. In frequentist statistics, the factor of noise means that we may see relationships (and thus come up with non-zero parameters) where they don’t exist, just because of the random noise. This is what p-values are meant to compensate for - if the relationship truly did not exist, what’s the probability, given the data, that we’d see the non-zero parameter estimate that we computed? Generally if this probability is less than 0.05 (i.e. p < 0.05) then we accept the result. Often with statistical inference you are trying to quantify some difference between groups (which can be framed as measuring an effect size) or testing if some data supports or refutes some hypothesis, and then trying to determine whether or not this difference or effect can be attributed to chance (this is covered in the section on experimental statistics). A word of caution: many statistical tools work only under certain conditions, e.g. assumptions of independence, or for a particular distribution, or a large enough sample size, or lack of skew, and so on - so before applying statistical methods and drawing conclusions, make sure the tools are appropriate for the data. And of course you must always be cautious of potential biases involved in the data collection process. 6.2.1 Error Dealing with error is a big part of statistics and some error is unavoidable (noise is natural). There are three kinds of error: • Systemic error (systemic flaws in the data collection, e.g. sampling bias) • Measurement error (due to imprecise instruments, for instance) • Random error (natural noise, due to chance, uncontrollable, but in theory its effect is minimized if many measurements are taken) We never know the true value of something, only what we observe by imprecise means, so we always must grapple with error. 6.2.2 Estimates and estimators We can think of the population as representing the underlying data generating process and consider these parameters as functions of the population. To estimate these parameters from the sample data, 174 CHAPTER 6. STATISTICS 175 6.2. INFERENTIAL STATISTICS we use estimators, which are functions of the sample data that return an estimate for some unknown value. Essentially, any statistic is an estimator. For instance, we may estimate the population mean by using the sample mean as our estimator. Or we may estimate the population variance as the sample variance. And so on. Bias Estimators may be biased for small sample sizes; that is, it tends to have more error for small sample sizes. Say we are estimating a parameter θ. The bias of an estimator θ̂m is: bias(θ̂ + m) = E[θ̂m ] − θ Where m is the number of samples. There are unbiased estimators as well, which have an expected mean error (against the population parameter) of 0. That is, bias(θ̂) = 0, which can also be stated as E[θ̂] = θ. For example, an unbiased estimator for population variance σ 2 is: 1 ∑ (xi − x̄)2 n−1 An estimator may be asymptotically unbiased if limm→∞ bias(θ̂m ) =, that is, if limm→∞ E[θ̂m ] = 0. Generally, unbiased estimators are preferred, but sometimes biased estimators have other properties which make them useful. For an estimate, we can measure its standard error (SE), which describes how much we expect the estimate to be off by, on average. It can also be stated as: √ SE(θ̂) = Var(θ̂) “Standard error” sometimes refers to the standard error of the mean, which is the standard deviation of the mean: v u m u 1 ∑ σ SE(µ̂m ) = tVar( x (i) ) = √ m i=1 m Much of statistical inference is concerned with measuring the quality of these estimates. CHAPTER 6. STATISTICS 175 6.2. INFERENTIAL STATISTICS 6.2.3 176 Consistency When we used a biased estimator, we generally still want our point estimates to converge to the true value of the parameter. This property is called consistency. For some error ϵ > 0, we want limm→∞ P (|θm − θ| > ϵ) = 0. That is, as we increase our sample size, we want the probability of the absolute difference between the estimate and the true value being greater than ϵ to approach 0. 6.2.4 Point Estimation Given an unknown population parameter, we may want to estimate a single value for it - this estimate is called a point estimate. Ideally, the estimate is as close to the true value as possible. The estimation formula (the function which yields an estimate) is called an estimator and is a random variable (so there is some underlying distribution). A particular value of the estimator is the estimate. A simple example: we have a series of trials with some number of successes. We want an estimate for the probability of success of the event we looked at. Here an obvious estimate is be the number of successes over the total number of trials, so our estimator would be Nx and - say we had 40 successes out of 100 trials - our estimate would be 0.4. We consider a “good” estimator one whose distribution is concentrated as closely as possible around the parameter’s true value (that is, it has a small variance). Generally this becomes the case as more data is collected. We can take multiple samples (of a fixed size) from a population and compute a point estimate (e.g. for the mean) from each. Then we can consider the distribution of these point estimates - this distribution is called a sampling distribution. The standard deviation of the sampling distribution describes the typical error of a point estimate, so this standard deviation is known as the standard error (SE) of the estimate. Alternatively, if you have only one sample, the standard error of the sample mean x̄ can be computed (where n is the size of the sample): σx SEx̄ = σx̄ = √ n This however requires the population standard deviation, σx , which probably isn’t known - but we can also use a point estimate for that as well; that is, you can just use s, the sample standard deviation, instead (provided that the sample size is at least 30, as a rule of thumb, and the population distribution is not strongly skewed). Also remember that the distribution of sample means approximates a normal distribution, with better approximation as sample size increases, as described by the central limit theorem. Some other point estimates’ sampling distribution also approximate a normal distribution. Such point estimates are called normal point estimates. There are other such computations for the standard error of other estimates as well. 176 CHAPTER 6. STATISTICS 177 6.2. INFERENTIAL STATISTICS We say a point estimate is unbiased if the sampling distribution of the estimate is centered at the parameter it estimates. 6.2.5 Nuisance Parameters Nuisance parameters are values we are not directly interested in, but still need to be dealt with in order to get at what we are interested in. 6.2.6 Confidence Intervals Rather than provide a single value estimate of a population parameter, that is, a point estimate, it can be better to provide a range of values for the estimate instead. This range of values is a confidence interval. The confidence interval is the range of values where an estimate is likely to fall with some percent probability. Confidence intervals are expressed in percentages, e.g. the “95% confidence interval”, which describes the plausibility that the parameter is in that interval. It does not imply a probability (that is, it does not mean that the true parameter has a 95% chance of being in that interval), however. Rather, the 95% confidence interval is the range of values in which, over repeated experimentation, in 95% of the experiments, that confidence interval will contain the true value. To put it another way, for the 95% confidence interval, out of every 100 experiments, at least 95 of their confidence intervals will contain the true parameter value. You would say “We are 95% confident the population parameter is in this interval”. Confidence intervals are a tool for frequentist statistics, and in frequentist statistics, unknown parameters are considered fixed (we don’t express them in terms of probability as we do in Bayesian statistics). So we do not associate a probability with the parameter. Rather, the confidence interval itself is the random variable, not the parameter. To put it another way, we are saying that 95% of the intervals we would generate from repeated experimentation would contain the real parameter but we aren’t saying anything about the parameter’s value changing, just that the intervals will vary across experiments. The mathematical definition of the 95% confidence interval is (where θ is the unknown parameter): P (a(Y ) < θ < b(Y )|θ) = 0.95 Where a, b are the endpoints of the interval, calculated according to the sampling distribution of Y . We condition on θ because, as just mentioned, in frequentist statistics, the parameters are fixed and the data Y is random. We can compute the 95% confidence interval by taking the point estimate (which is the best estimate for the value) and ±2 SE, that is build an interval within two standard errors of the point estimate. The interval we add or subtract to the point estimate (here it is 2 SE) is called the margin of error. The value we multiply the SE with is essentially a Z score, so we can more generally describe the margin of error as z SE. CHAPTER 6. STATISTICS 177 6.3. EXPERIMENTAL STATISTICS 178 For the confidence interval of the mean, we can be more precise and look within ±1.96 SE (that is, z = 1.96) of the point estimate x̄ for the 95% confidence interval (that is, our 95% confidence interval would be x̄ ± 1.96 SE). This is because we know to the sampling distribution for the sample means approximates a normal distribution (for sufficiently large sample sizes, n ≥ 30 is a rule of thumb) according to the central limit theorem. 6.2.7 Kernel Density Estimates Sometimes we don’t want the parameters of our data’s distribution, but just a smoothed representation of it. Kernel density estimation allows us to get this representation. It is a nonparametric method because it makes no assumptions about the form of the underlying distribution (i.e. no assumptions about its parameters). Kernel Density Estimation example Some kernel function (which generates symmetric densities) is applied to each data point, then the density estimate is formed by summing the densities. The kernel function determines the shape of these densities and the bandwidth parameter, h > 0, determines their spread and the smoothing of the estimate. Typically, a Gaussian kernel function is used, so the bandwidth is equivalent to the variance. In this figure, the grey curve is the true density, the red curve is the KDE with h = 0.05, the black curve is the KDE with h = 0.337, and the green curve is the KDE with h = 2. 6.3 Experimental Statistics Experimental statistics is concerned with hypothesis testing, where you have a hypothesis and want to learn if your data supports it. That is, you have some sample data and an apparent effect, and you want to know if there is any reason to believe that the effect is genuine and not just by chance. Often you are comparing two or more groups; more specifically, you are typically comparing statistics across these groups, such as their means. For example, you want to see if the difference of their means is statistically significant; which is to say, likely that it is a real effect and not just chance. 178 CHAPTER 6. STATISTICS 179 6.3. EXPERIMENTAL STATISTICS KDE bandwidth comparisons The “classical” approach to hypothesis testing, null hypothesis significance testing (NHST), follows this general structure: 1. Quantify the size of the apparent effect by choosing some test statistic, which is just a summary statistic which is useful for hypothesis testing or identifying p-values. For example, if you have two populations you’re looking at, this could be the difference in means (of whatever you are measuring) between the two groups. 2. Define a null hypothesis, which is usually that the apparent effect is not real. 3. Compute a p-value, which is the probability of seeing the effect if the null hypothesis is true. 4. Determine the statistical significance of the result. The lower the p-value, the more significant the result is, since the less likely it is to have just occurred by chance. Broadly speaking, there are two types of scientific studies: observational and experimental. In observational studies, the research cannot interfere while recording data; as the name implies, the involvement is merely as an observer. Experimental studies, however, are deliberately structured and executed. They must be designed to minimize error, both at a low level (e.g. imprecise instruments or measurements) and at a high-level (e.g. researcher biases). 6.3.1 Statistical Power The power of a study is the likelihood that it will distinguish an effect of a certain size from pure luck. - Statistical power and underpowered statistics, Alex Reinhart Statistical power, sometimes called sensitivity, can be defined as the probability of rejecting the null hypothesis when it is false. If β is the probability of a type II error (i.e. failing to reject the null hypothesis when it’s false), then power = 1 − β. Power… CHAPTER 6. STATISTICS 179 6.3. EXPERIMENTAL STATISTICS 180 • Increases as n (sample size) increases • Increases as σ decreases (less variability) • Is higher for a one-sided test than for its associated two-sided test 6.3.2 Sample Selection Bias can enter studies primarily in two ways: • in the process of selecting the objects to study (sampling and retention) • in the process of collection information about the objects To prevent selection bias (selecting samples in such a way that it encourages a particular outcome, whether done consciously or not), sample selection may be random. In the case of medical trials and similar studies, random allocation is ideally double blind, so that neither the patient nor the researchers know which treatment a patient is receiving. Another sample selection technique is stratified sampling, in which the population is divided into categories (e.g. male and female) and samples are selected from those subgroups. If the variable used for stratification is strongly related to the variable being studied, there may be better accuracy from the sample size. You need large sample sizes because with small sample sizes, you’re more sensitive to the effects of chance. e.g. if I flip a coin 10 times, it’s feasible that I get heads 6/10 times (60% of the time). With that result I couldn’t conclusively say whether or not that coin is rigged. If I flip that coin 1000 times, it’s extremely unlikely that I will get heads 60% of the time (600/1000 times) if it were a fair coin. Sometimes to increase sample size, a researcher may use a technique called “replication”, which is simply repeating the measurements with new samples. but some researchers really only “pseudoreplicate”. samples should be as independent from each other as possible - otherwise you have too many confounding factors. in medical research, researchers may sample a single patient multiple times, every week for instance, and treat each week’s sample as a distinct sample. this is pseudoreplication - you begin to inflate other factors particular to that patient in your results. another example is - say you wanted to measure pH levels in soil samples across the US. well, you cant sample soil 15ft from each other because they are too dependent on each other: Operationalization Operationalization is the practice of coming up with some way of measuring something which cannot be directly measured, such as intelligence. This may be accomplished via proxy measurements. 6.3.3 The Null Hypothesis In an experiment, the null hypothesis, notated H0 , is the “status quo”. For example, in testing whether or not a drug has an impact on a disease, the null hypothesis would be that the drug has no effect. 180 CHAPTER 6. STATISTICS 181 6.3. EXPERIMENTAL STATISTICS When running an experiment, you do it under the assumption that the null hypothesis is true. Then you ask: what’s the probability of getting the results you got, assuming the null hypothesis is true? If that probability is very small, the null hypothesis is likely false. This probability - of getting your results if the null hypothesis were true - is called the P value. 6.3.4 Type 1 Errors A type 1 error is one where the null hypothesis is rejected, even though it is true. Type 1 errors are usually presented as a probability of them occurring, e.g. a “0.5% chance of a type 1 error” or a “type 1 error with probability of 0.01”. 6.3.5 P Values P values are central to null hypothesis significance testing (NHST), but they are commonly misunderstood. P values do not: • tell you the probability of the null hypothesis being true • tell you the probability of any hypothesis being true • can never prove or disprove hypotheses There’s no mathematical tool to tell you if your hypothesis is true; you can only see whether it is consistent with the data, and if the data is sparse or unclear, your conclusions are uncertain. - Statistics Done Wrong, Alex Reinhart So what is it then? The P value is the probability of seeing your results or data if the null hypothesis were true. That is, given data D and a hypothesis H, where H0 is the null hypothesis, the P value is merely: P (D|H0 ) If instead we want to find the probability of our hypothesis given the data, that is, P (H|D), we have to use Bayesian inference instead: P (H | D) = P (D | H)P (H) P (D | H)P (H) + P (D | ¬H)P (̸ H) Note that P values are problematic when testing multiple hypotheses (multiple testing or multiple comparisons) because any “significant” results (as determined by P value comparisons, e.g. p < 0.5) may be deceptively so, since that result may still have just been chance, as the following comic CHAPTER 6. STATISTICS 181 6.3. EXPERIMENTAL STATISTICS 182 xkcd - “Significant” 182 CHAPTER 6. STATISTICS 183 6.3. EXPERIMENTAL STATISTICS illustrates. That is, the more significance tests you conduct, the more likely you will make a Type 1 Error. In this comic, 20 hypotheses are tested, so with a significance level at 5%, it’s expected that at least one of those tests will come out significant by chance. In the real world this may be problematic in that multiple research groups may be testing the same hypothesis and chances may be such that one of them gets significant results. 6.3.6 The Base Rate Fallacy A very important shortcoming to be aware of is the base rate fallacy. A P value cannot be considered in isolation. The base rate of whatever occurrence you are looking at must also be taken into account. Say you are testing 100 treatments for a disease, and it’s a very difficult disease to treat, so there’s a low chance (say 1%) that a treatment will actually be successful. This is your base rate. A low base rate means a higher probability of false positives - treatments which, during the course of your testing, may appear to be successful but are in reality not (i.e. their success was a fluke). A good example is the mammogram test example (see The p value and the base rate fallacy). A p value is calculated under the assumption that the medication does not work and tells us the probability of obtaining the data we did, or data more extreme than it. It does not tell us the chance the medication is effective. (The p value and the base rate fallacy, Alex Reinhart) 6.3.7 False Discovery Rate The false discovery rate is the expected proportion of false positives (Type 1 errors) amongst hypothesis tests. For example, if we have a maximum FDR of 0.10 and we have 1000 observations which seem to indicate a significant hypothesis, then we can expect 100 of those observations to be false positives. The q value for an individual hypothesis is the minimum FDR at which the test may be called significant. Say you run multiple comparisons and have the following values: • • • • • • • • m = the total number of hypotheses tested (number of comparisons) m0 = the number of true null hypotheses (H0 ) m − m0 = the number of true alternative hypotheses (Hi ) V = the number of false positives (Type 1 errors) S = the number of true positives T = the number of false negatives (Type 2 errors) U = the number of true negatives R = V + S = the number of hypotheses declared significant CHAPTER 6. STATISTICS 183 6.3. EXPERIMENTAL STATISTICS 184 We can calculate the FDR as: FDR = E[ Note that 6.3.8 V R V V ] = E[ ] V +S R = 0 if R = 0. Alpha Level The value that you select to compare the p-value to, e.g. 0.5 in the comic, is the alpha level ᾱ, also called the significance level, of an experiment. Your alpha level should be selected according to the number of tests you’ll be conducting in an experiment. There are some approaches to help adjust the alpha level. The Bonferroni Correction The highly conservative Bonferroni Correction can be used as a safeguard. You divide whatever your significance level ᾱ is by the number of statistical tests t you’re doing: αp = ᾱ t αp is the per-comparison significance level which you apply for each individual test, and ᾱ is the maximum experiment-wide significance level, called the maximum familywise error rate (FWER). The Sidak Correction A more sensitive correction, the Sidak Correction, can also be used: 1 αp = 1 − (1 − ᾱ) n For n independent comparisons, α, the experiment-wide significance level (the FWER) is: α = 1 − (1 − αp )n For n dependent comparisons, use: α ≤ nαp 184 CHAPTER 6. STATISTICS 185 6.3. EXPERIMENTAL STATISTICS 6.3.9 The Benjamini-Hochberg Procedure Approaches like the Bonferroni correction lowers the alpha level which end up decreasing your statistical power - that is, you fail to detect false effects as well as true effects. And with such an approach, you are still susceptible to the base rate fallacy, and may still have false positives. So how can you calculate the false discovery rate? That is, what fraction of the statistically Significant results are false positives? You can use the Benjamini-Hochberg procedure, which tells you which P values to consider statistically significant: 1. Perform your statistical tests and get the P value for each. Make a list and sort it in ascending order. 2. Choose a false-discovery rate q. The number of statistical tests is m. 3. Find the largest p value such that p ≤ iq m , where i is the P value’s place in the sorted list. 4. Call that P value and all smaller than it statistically significant. The procedure guarantees that out of all statistically significant results, no more than q percent will be false positives. The Benjamini-Hochberg procedure is fast and effective, and it has been widely adopted by statisticians and scientists in certain fields. It usually provides better statistical power than the Bonferroni correction and friends while giving more intuitive results. It can be applied in many different situations, and variations on the procedure provide better statistical power when testing certain kinds of data. Of course, it’s not perfect. In certain strange situations, the Benjamini-Hochberg procedure gives silly results, and it has been mathematically shown that it is always possible to beat it in controlling the false discovery rate. But it’s a start, and it’s much better than nothing. - (Controlling the false discovery rate, Alex Reinhart) 6.3.10 Sum of Squares The sum of squares within (SSW) SSW = m ∑ ∑ ( j = 1n (xij − x̄i )2 ) i=1 • This shows how much of SST is due to variation within each group, i.e. variation from within that group’s mean. • The degrees of freedom here is calculated m(n − 1). CHAPTER 6. STATISTICS 185 6.3. EXPERIMENTAL STATISTICS 186 The sum of squares between (SSB) SSB = m ∑ [nm [(x¯m − x̄¯)2 ]] i=1 • This shows how much of SST is due to variation between the group means • The degrees of freedom here is calculated m − 1. The total sum of squares (SST) SST = m ∑ ∑ ( j = 1n (xij − x̄¯)2 ) i=1 SST = SSW + SSB • Note: x̄¯ is the mean of means, or the “grand mean”. • This is the total variation for the groups • The degrees of freedom here is calculated mn − 1. 6.3.11 Statistical Tests Two-sided tests Asks “What is the chance of seeing an effect as big as the observed effect, without regard to its sign?” That is, you are looking for any effect, increase or decrease. One-sided tests Asks “What is the chance of seeing an effect as big as the observed effect, with the same sign?” That is, you are looking for either only an increase or decrease. Unpaired t-test The most basic statistical test, used when comparing the means from two groups. Used for small sample sizes. The t-test returns a p-value. Paired t-test The paired t-test is a t-test used when each datapoint in one group corresponds to one datapoint in the other group. 186 CHAPTER 6. STATISTICS 187 6.3. EXPERIMENTAL STATISTICS Chi-squared test When comparing proportions of two populations, it is common to use the chi-squared statistic: χ2 = ∑ (Oi − Ei )2 i Ei Where Oi is the observed frequencies and Ei is the expected frequencies. Say for example you want to test if a coin is fair. You expect that, if it is fair, you should see about 50/50 heads and tails - this describes your expected frequencies. You flip the coin and observe the actual resulting frequencies - these are your observed frequencies. The Chi squared test allows you to determine if these frequencies differ significantly. ANOVA (Analysis of Variance) ANOVA, ANCOVA, MANOVA, and MANCOVA are various ways of comparing different groups. • ANOVA - group A is given a placebo and group B is given the actual medication and the outcome variable to compare is how many pounds were lost • ANCOVA - same as ANOVA but now there is an additional covariate we consider, e.g. hours of exercise per day • MANOVA and MANCOVA are multivariate counterparts to the above, for instance we may consider cholesterol levels in addition to weight loss ANOVA is used to compare three or more groups. It uses a single test to compare the means across multiple groups simultaneously, which avoids using multiple tests to make multiple comparisons (which can lead to differences across groups resulting from chance). There are a few requirements: • the observations are independent within and across groups • the data within each group are nearly normal • the variance in the groups are about equal across groups ANOVA tests the null hypothesis that the means across groups are the same (that is, that µ1 = · · · = µk , if there are k groups), with the alternate hypothesis being that at least one mean is different. We look at the variability in the sample means and see if it is so large that it is unlikely to have been due to chance. THe variability we use is the mean square between groups (MSG) which has degrees of freedom df G = k − 1. The MSG is calculated: k 1 ∑ 1 SSG = MSG = ni (x̄i − x̄)2 df G k − 1 i=1 CHAPTER 6. STATISTICS 187 6.3. EXPERIMENTAL STATISTICS 188 Where the SSG is the sum of squares between groups and ni is the sample size of group i out of k total groups. x̄ is the mean of outcomes across all groups. We need a value to compare the MSG to, which is the mean square error (MSE), which measures the variability within groups and has degrees of freedom df E = n − k. The MSE is calculated: MSE = 1 SSE df E Where the SSE is the sum of squared errors and is computed as: SSE = SST − SSG Where the SSG is same as before and the SST is the sum of squares total: SST = n ∑ (xi − x̄)2 i=1 SSG = k ∑ ni (x̄i − x̄)2 i=1 ANOVA uses a test statistic called F , which is computed: F = MSG MSE When the null hypothesis is true, difference in variability across sample means should be due only to chance, so we expect MSG and MSE to be about equal (and thus F to be close to 1). We take this F statistic and use it with a test called the F test, where we compute a p-value from the F statistic, using the F distribution, which has the parameters df 1 and df 2 . We expect ANOVA’s F statistic to follow an F distribution with parameters df 1 = df G , df 2 = df E if the null hypothesis is true. One-Way ANOVA Similar to a t-test but used to compare three or more groups. With ANOVA, you calculate the F statistic, assuming the null hypothesis1 : F = 1 188 SSB m−1 SSW m(n−1) Remember that SSB is the “sum of squares between” and SSW is the “sum of squares within”. CHAPTER 6. STATISTICS 189 6.3. EXPERIMENTAL STATISTICS Two-Way ANOVA Allows you to compare the means of two or more groups when there are multiple variables or factors to be considered. One-tailed & two-tailed tests In a two-tailed test, both tails of a distribution are considered. For example, with a drug where you’re looking for any effect, positive or negative. In a one-tailed, only one tail is considered. For example, you may be looking only for a positive or only for a negative effect. 6.3.12 Effect Size A big part of statistical inference is measuring effect size, which more generally is trying to quantify differences between groups, but typically just referred to as “effect size”. There are a few ways of measuring effect size: Difference in means The difference in means, e.g. µ1 − µ2 But this has a few problems: • Must be expressed in the units of measure of the mean (e.g. ft, kg, etc), so it can be difficult to compare to other studies • Needs more context about the distributions (e.g. standard deviation) to understand if the difference is large or not Distribution overlap The overlap between the two distributions: Choose some threshold between the two means, e.g. • The midpoint between the means: 2 µ2 • Where the PDFs cross: σ1 µσ11 +σ +σ2 µ1 +µ2 2 Count how many in the first group are below the threshold, call it m1 Count how many in the second group are above the threshold, call it m2 . The overlap then is: m1 m2 + n1 n2 CHAPTER 6. STATISTICS 189 6.3. EXPERIMENTAL STATISTICS 190 Where n1 , n2 are the sample sizes of the first and second groups, respectively. This overlap can also be framed as a misclassification rate, which is just overlap 2 . These measures are unitless, which makes them easy to compare across studies. Probability of superiority The “probability of superiority” is the probability that a randomly chosen datapoint from group 1 is greater than a randomly chosen datapoint from group 2. This measure is also unitless. Cohen’s d Cohen’s d is the difference in means, divided by the standard deviation, which is computed from the pooled variance, σp2 , of the groups: n1 σ12 + n2 σ22 = n1 + n2 µ1 − µ2 d= √ σp2 σp2 This measure is also unitless. Different fields have different intuitions about how big a d value is; it’s something you have to learn. 6.3.13 Reliability Reliability refers to how consistent or repeatable a measurement is (for continuous data). There are three main approaches: Multiple-occasions reliability Aka test-retest reliability. This is how a test holds up over repeated testing, e.g. “temporal stability”. This assumes the underlying metric does not change. Multiple-forms reliability Aka parallel-forms reliability. This asks: how consistent are different tests at measuring the same thing? Internal consistency reliability This asks: do the items on a test all measure the same thing? 190 CHAPTER 6. STATISTICS 191 6.3.14 6.4. HANDLING DATA Agreement Agreement is similar to reliability, but used more for discrete data. Percent agreement number of cases where tests agreed all cases Note that a high percent agreement may be obtained by chance. Cohen’s kappa Often just called kappa, this corrects for the possibility of chance agreement: κ= po − pe 1 − pe agreement Where po is the observed agreement, that is, num. total cases , and pe is the expected agreement. Kappa ranges from -1 to 1, where 1 is perfect agreement. 6.4 Handling Data 6.4.1 Transforming data Occasionally you may find data easier to work with if you apply a transformation to it; that is, rescale it in some way. For instance, you might take the natural log of your values, or the square root, or the inverse. This can reduce skew and the effect of outliers or make linear modeling easier. The function which applies this transformation is called a link function. 6.4.2 Dealing with missing data Data can be missing for a few reasons: • Missing completely at random (MCAR) - missing cases are identical to non-missing cases, on average. • Missing at random (MAR) - Missing data depends on measured values, so they can be modeled by other observed variables. • Missing not at random (MNAR) - Missing data depends on unmeasured/unknown variables, so there is no way to account for them. CHAPTER 6. STATISTICS 191 6.4. HANDLING DATA 192 There are a few strategies for dealing with missing data. The worst you can do is to ignore the missing data and try to run your analysis, missing data and all (it likely won’t and probably shouldn’t work). Alternatively, you can delete all datapoints which have missing data, leaving only complete data points - this is called complete case analysis. Complete case analysis makes the most sense with MCAR missing data - you will have a reduction in sample size, and thus a reduction in statistical power, as a result, but your inference will not be biased. The possibly systemic nature of missing data in MAR and MNAR means that complete case analysis may overlook important details for your model. You also have the option of filling in missing values - this is called imputation (you “impute” the missing values). You can, for instance, filling in missing values with the mean of that variable. You don’t gain any of the information that was missing, and you end up ignoring the uncertainty associated with the fill-in value (and the resulting variances will be artificially reduced), but you at least get to maintain your sample size. Again, bias may be introduced in MAR and MNAR situations since the missing data may be due to some systemic cause. One of the better approaches is multiple imputation, which produces unbiased parameter estimates and accounts for the uncertainty of imputed values. A regression model is used to generated the imputed values, and does well especially under MAR conditions - the regression model may be able to exploit info in the dataset about the missing data. If some known values correlate with the missing values, they can be of use in this way. Then, instead of using the regression model to produce one value for each missing value, multiple values are produced, so that the end result is multiple copies of your dataset, each with different imputed values for the missing values. Your perform your analysis across all datasets and average the produced estimates. 6.4.3 Resampling Resampling involves repeatedly drawing subsamples from an existing sample. Resampling is useful for assessing and selecting models and for estimating the precision of parameter estimates. A common resampling method is bootstrapping. Bootstrapping Bootstrapping is a resampling method to approximate the true sampling distribution of a dataset, which can then be used to estimate the mean and the variance of the distribution. The advantage with bootstrapping is that there is no need to compute derivatives or make assumptions about the distribution’s form. You take R samples Si∗ , with replacement, each of size n (i.e. each resample is the same size as the original sample), from your dataset. These samples, S ∗ = S1∗ , . . . , SR∗ are called replicate bootstrap samples. Then you can compute an estimate of the t statistic for each of the bootstrap samples, Ti∗ = t(Si∗ ). 192 CHAPTER 6. STATISTICS 193 6.5. REFERENCES Then you can estimate the mean and variance: ∑ ∗ Ti∗ R ∑ (T ∗ − T̄ ∗ )2 ∗ ˆ Var(T )= i i R−1 ∗ T̄ = Ê[T ] = i With bootstrap estimates, there are two possible sources of error. You may have the sampling error from your original sample S in addition to the bootstrap error, from failing to be comprehensive in your sampling of bootstrap samples. To avoid the latter, you should try to choose a large R, such as R = 1000. 6.5 • • • • • • • • • • • • • • • • • • • • • • • • References Review of fundamentals, IFT725. Hugo Larochelle. 2012. Statistical Inference Course Notes, Xing Su Regression Models Course Notes, Xing Su Statistics in a Nutshell. Second Edition. Sarah Boslaugh. What is the difference between descriptive and inferential statistics?. Jeromy Anglim. Understanding Variance, Co-Variance, and Correlation. Count Bayesie. Will Kurt. Think Stats: Exploratory Data Analysis in Python. Version 2.0.27. Allen B Downey. Principles of Statistics, M.G. Bulmer. 1979. OpenIntro Statistics. Second Edition. David M Diez, Christopher D Barr, Mine ÇetinkayaRundel. Computational Statistics I. Allen Downey. SciPy 2015. Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015. Bayesian Statistical Analysis. Chris Fonnesbeck. SciPy 2014. Lecture Notes from CS229 (Stanford). Data Analysis Using Regression and Multilevel/Hierarchical Models. First edition. Andrew Gelman and Jennifer Hill. Frequentism and Bayesianism: A Practical Introduction. Jake Vanderplas Machine Learning. 2014. Andrew Ng. Stanford University/Coursera. Introduction to Artificial Intelligence (Udacity CS271). Peter Norvig and Sebastian Thrun. Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. Controlling the false discovery rate. Alex Reinhart. The p value and the base rate fallacy. Alex Reinhart. Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy, Steven N. Goodman, MD, PhD Misinterpretations of Significance: A Problem Students Share with Their Teachers?, Heiko Haller & Stefan Krauss Statistics Done Wrong, Alex Reinhart Stevens, S. S. (1946). On the theory of scales of measurement. CHAPTER 6. STATISTICS 193 6.5. REFERENCES 194 194 CHAPTER 6. STATISTICS 195 7 Bayesian Statistics Bayesian statistics is an approach to statistics contrasted with frequentist approaches. As is with frequentist statistical inference, Bayesian inference is concerned with estimating parameters from some observed data. However, whereas frequentist inference returns point estimates - that is, single values - for these parameters, Bayesian inference instead expresses these parameters themselves as probability distributions. This is intuitively appealing as we are uncertain about the parameters we’ve inferred; with Bayesian inference we can represent this uncertainty. This is to say that in Bayesian inference, we don’t assign an explicit value to an unknown parameter. Rather, we define it over a probability distribution as well: what values is the parameter likely to take on? That is, we treat the parameter itself as a random variable. We may say for instance that an unknown parameter θ is drawn from an exponential distribution: θ ∼ Exp(α) Here α is a hyperparameter, that is, it is a parameter for our parameter θ. Fundamentally, this is Bayesian inference: P (θ|X) Where the parameters θ are the unknown, so we express them as a probability distribution, given the observations X. This probability distribution is the posterior distribution. So we must decide (specify) probability distributions for both the data sample and for the unknown parameters. These decisions involve making a lot of assumptions. Then you must compute a posterior distribution, which often cannot be calculated analytically - so other methods are used (such as simulations, described later). CHAPTER 7. BAYESIAN STATISTICS 195 7.1. BAYES’ THEOREM 196 From the posterior distribution, you can calculate point estimates, credible intervals, quantiles, and make predictions. Finally, because of the assumptions which go into specifying the initial distributions, you must test your model and see if it fits the data and seems reasonable. Thus Bayesian inference amounts to: 1. Specifying a sampling model for the observed data X, conditioned on the unknown parameter θ (which we treat as a random variable), such that X ∼ f (X|θ), where f (X|θ) is either the PDF or the PMF (as appropriate). 2. Specifying a marginal or distribution π(θ) for θ, which is the prior distribution (“prior” for short): θ ∼ π(θ) 3. From this we wish to compute the posterior, that is, uncover the distribution for θ given the π(θ)L(θ|X) observed data X, like so: π(θ|X) = ∫ π(θ)L(θ|X)dθ , where L(θ|X) ∝ f (θ|X) in θ, called the likelihood of θ given X. More often than not, the posterior must be approximated through Markov Chain Monte Carlo (detailed later). 7.0.1 Frequentist vs Bayesian approaches For frequentists, probability is thought of in terms of frequencies, i.e. the probability of the event is the amount of times it happened over the total amount of times it could have happened. In frequentist statistics, the observed data is considered random; if you gathered more observations they would be different according to the underlying distribution. The parameters of the model, however, are considered fixed. For Bayesians, probability is belief or certainty about an event. Observed data is considered fixed, but the model parameters are random (uncertain) instead and considered to be drawn from some probability distribution. Another way of phrasing this is that frequentists are concerned with uncertainty in the data, whereas Bayesians are concerned with uncertainty in the parameters. 7.1 Bayes’ Theorem In frequentist statistics, many different estimators may be used, but in Bayesian statistics the only estimator is Bayes’ Formula (aka Bayes’ Rule or Bayes’ Theorem). Bayes’ Theorem, aka Bayes’ Rule: • H is the hypothesis (more commonly represented as the parameters θ) • D is the data P (H|D) = 196 P (H)P (D|H) P (D) CHAPTER 7. BAYESIAN STATISTICS 197 • • • • 7.1. BAYES’ THEOREM P (H) = the probability of the hypothesis before seeing the data. The prior. P (H|D) = probability of the hypothesis, given the data. The posterior. P (D|H) = the probability of the data under the hypothesis. The likelihood. P (D) = the probability of data under any hypothesis. The normalizing constant. For an example of likelihood: If I want to predict the sides of a dice I rolled, and then I rolled an 8, then P (D|a six sided die) = 0. That is, it is impossible to have my observed data under the hypothesis of having a six sided die. A key insight to draw from Bayes’ Rule is that P (H|D) ∝ P (H)P (D|H), that is, the posterior is proportional to the product of the prior and the likelihood. Note that the normalizing constant P (D) usually cannot be directly computed and is equivalent to ∫ P (D|H)P (H)dH (which is usually intractable since their are usually multiple parameters of interest, resulting in a multidimensional integration problem. If θ, the parameters, is one dimensional, then you could integrate it rather easily). One workaround is to do approximate inference with non-normalized posteriors, since we know that the posterior is proportional to the numerator term: P (H|D) ∝ P (H)P (D|H) Another workaround to approximate the posterior using simulation methods such as Monte Carlo. Given a set of hypotheses H0 , H1 , . . . , Hn , the distribution for the priors of these hypotheses is the prior distribution, i.e. P (H0 ), P (H1 ), . . . , P (Hn ). The distribution of the posterior probabilities is the posterior distribution, i.e. P (H0 |D), P (H1 |D), . . . , P (Hn |D). 7.1.1 Likelihood Likelihood is not the same as probability (thus it does not have to sum to 1), but it is proportional to probability. More specifically, the likelihood of a hypothesis H given some data D is proportional to the probability of D given that H is true: L(H|D) = kP (D|H) Where k is a constant such that k > 0. With the probability P (D|H), we fix H and allow D to vary. In the case of likelihood, this is reversed: we fix D and allow the hypotheses to vary. The Law of Likelihood states that the hypothesis for which the probability of the data is greater is the more likely hypothesis. For example, H1 is a better hypothesis than H2 if P (D|H1 ) > P (D|H2 ). We can also quantify how much better H1 is than H2 with the ratio of their likelihoods, i.e. which is proportional to L(H1 |D) L(H2 |D) , P (D|H1 ) P (D|H2 ) . CHAPTER 7. BAYESIAN STATISTICS 197 7.2. CHOOSING A PRIOR DISTRIBUTION 198 Likelihoods are meaningless in isolation (because of the constant k), they must be compared to other likelihoods, such that the constants cancel out, i.e. as ratios like the example above, to be meaningful. A Bayes factor is an extension of likelihood ratios: it is a weighted average likelihood ratio based on the prior distribution of hypotheses. So we have some prior bias as to what hypotheses we expect, i.e. how probable we expect some hypotheses to be, and we weigh the likelihood ratios by these expected probabilities. 7.2 Choosing a prior distribution With Bayesian inference, we must choose a prior distribution, then apply data to get our posterior distribution. The prior is chosen based on domain knowledge or intuition or perhaps from the results of previous analysis; that is, it is chosen subjectively - there is no prescribed formula for picking a prior. If you have no idea what to pick, you can just pick a uniform distribution as your prior. Your choice of prior will affect the posterior that you get, and the subjectivity of this choice is what makes Bayesian statistics controversial - but it’s worth noting that all of statistics, whether or frequentist or Bayesian, involves many subjective decisions (e.g. frequentists must decide on an estimator to use, what data to collect and how, and so on) - what matters most is that you are explicit about your decisions and why you made them. Say we perform an Bayesian analysis and get a posterior. Then we get some new data for the same problem. We can re-use the posterior from before as our prior, and when we run Bayesian analysis on the new data, we will get a new posterior which reflects the additional data. We don’t have to re-do any analysis on the data from before, all we need is the posterior generated from it. For any unknown quantity we want to model, we say it is drawn from some prior of our choosing. This is usually some parameter describing a probability distribution, but it could be other values as well. This is central to Bayesian statistics - all unknowns are represented as distributions of possible values. In Bayesian statistics: if there’s a value and you don’t know what it is, come up with a prior for it and add it to your model! If you think of distributions as landscapes or surfaces, then the data deforms the prior surface to mold it into the posterior distribution. The surface’s “resistance” to this shaping process depends on the selected prior distribution. When it comes to selecting Bayesian priors, there are two broad categories: • objective priors - these let the data influence the posterior the most • subjective priors - these allow the practitioner to asset their own views in to the prior. This prior can be the posterior from another problem or just come from domain knowledge. An example objective prior is a uniform (flat) prior where every value has equal weighting. Using a uniform prior is called The Principle of Indifference. Note that a uniform prior restricted within a range is not objective - it has to be over all possibilities. Note that the more data you have (as N increases), the choice of prior becomes less important. 198 CHAPTER 7. BAYESIAN STATISTICS 199 7.2.1 7.2. CHOOSING A PRIOR DISTRIBUTION Conjugate priors Conjugate priors are priors which, when combined with the likelihood, result in a posterior which is in the same family. These are very convenient because the posterior can be calculated analytically, so there is no need to use approximation such as Markov Chain Monte Carlo (see below). For example, a binomial likelihood is a conjugate with a beta prior - their combination results in a beta-binomial posterior. For example, the Gaussian family of distributions are conjugate to itself (self conjugate) - a Gaussian likelihood with a Gaussian prior results in a Gaussian posterior. For example, when working with count data you will probably use the Poisson distribution for your likelihood, which is conjugate with gamma distribution priors, resulting in a gamma posterior. Unfortunately, conjugate priors only really show up in simple one-dimensional models. More generally, we can define a conjugate prior like so: Say random variable X comes from a well-known distribution, fα where α are the possibly unknown parameters of f . It could be a normal, binomial, etc distribution. For the given distribution fα , there may exist a prior distribution pβ such that data prior z}|{ z }| { posterior z}|{ pβ · fα (X) = pβ ′ Beta-Binomial Model The Beta-Binomial model is a useful Bayesian model because it provides values between 0 and 1, which is useful for estimating probabilities or percentages. It involves, as you might expect, a beta and a binomial distribution. So say we have N trials and observe n successes. We describe these observations by a binomial distribution, n ∼ Bin(N, p) for which p is unknown. So we want to come up with some distribution for p (remember, with Bayesian inference, you do not produce point estimates, that is, a single value, but a distribution for your unknown value to describe the uncertainty of its true value). For frequentist inference we’d estimate p̂ = n N which isn’t quite good for low numbers of N. This being Bayesian inference, we first must select a prior. p is a probability and therefore is bound to [0, 1]. So we could choose a uniform prior over that interval; that is p ∼ Uniform(0, 1). However, Uniform(0, 1) is equivalent to a beta distribution where α = 1, β = 1, i.e. Beta(1, 1). The beta distribution is bound between 0 and 1 so it’s a good choice for estimating probabilities. We prefer a beta prior over a uniform prior because, given binomial observations, the posterior will also be a beta distribution. It works out nicely mathematically: CHAPTER 7. BAYESIAN STATISTICS 199 7.2. CHOOSING A PRIOR DISTRIBUTION 200 p ∼ Beta(α, β) n ∼ Bin(N, p) p | n, N ∼ Beta(α + n, β + N − n) So with these two distributions, we can directly compute the posterior with no need for simulation (e.g. MCMC). How do you choose the parameters for a Beta prior? Well, it depends on the particular problem, but a conservative one, for when you don’t have a whole lot of info to go on, is Beta( 21 , 12 ), known as Jeffrey’s prior. Example We run 100 trials and observe 10 successes. What is the probability p of a successful trial? Our knowns are N = 100, n = 10. A binomial distribution describes these observations, but we have the unknown parameter p. For our prior for p we choose Beta(1, 1) since it is equivalent to a uniform prior over [0, 1] (i.e. it is an objective prior). We can directly compute the posterior now: p | n, N ∼ Beta(α + n, β + N − n) p ∼ Beta(11, 91) Then we can draw samples from the distribution and compute its mean or other descriptive statistics such as the credible interval. 7.2.2 Sensitivity Analysis The strength of the prior affects the posterior - the stronger your prior beliefs, the more difficult it is to change those beliefs (it requires more data/evidence). You can conduct sensitivity analysis to try your approach with various different priors to get an idea of how different priors affect your resulting posterior. 7.2.3 Empirical Bayes Empirical Bayes is a method which combines frequentist and Bayesian approaches by using frequentist methods to select the hyperparameters. For instance, say you want to estimate the µ parameter for a normal distribution. You could use the empirical sample mean from the observed data: µp = 200 N 1∑ Xi N i=0 CHAPTER 7. BAYESIAN STATISTICS 201 7.3. MARKOV CHAIN MONTE CARLO (MCMC) Where µp denotes the prior µ. Though if working with not much data, this kind of ends like double-counting your data. 7.3 Markov Chain Monte Carlo (MCMC) With Bayesian inference, in order to describe your posterior, you often must evaluate complex multidimensional integrals (i.e. from very complex, multidimensional probability distributions), which can be computationally intractable. Instead you can generate sample points from the posterior distribution and use those samples to compute whatever descriptions you need. This technique is called Monte Carlo integration, and the process of drawing repeated random samples in this way is called Monte Carlo simulation. In particular, we can use a family of techniques known as Markov Chain Monte Carlo, which combine Monte Carlo integration and simulation with Markov chains, to generate samples for us. 7.3.1 Monte Carlo Integration Monte Carlo integration is a way to approximate complex integrals using random number generation. Say we have a complex integral: ∫ h(x)dx If we can decompose h(x) into the product of a function f (x) and a probability density function P (x) describing the probabilities of the inputs x, then: ∫ ∫ h(x)dx = f (x)P (x)dx = EP (x) [f (x)] That is, the result of this integral is the expected value of f (x) over the density P (x). We can approximate this expected value by taking the mean of many, many samples (n samples): ∫ h(x)dx = EP (x) [f (x)] ≈ n 1∑ f (xi ) n i=1 This process of approximating the integral is Monte Carlo integration. For very simple cases of known distributions, we can sample directly, e.g. CHAPTER 7. BAYESIAN STATISTICS 201 7.3. MARKOV CHAIN MONTE CARLO (MCMC) 202 import numpy as np # Say we think the distribution is a Poisson distribution # and the parameter of our distribution, lambda, # is unknown and what we want to discover. lam = 5 # Collect 100000 samples sim_vals = np.random.poisson(lam, size=100000) # Get whatever descriptions we want, e.g. mean = sim_vals.mean() # For poisson, the mean is lambda, so we expect # them to be approximately equal (given a large enough sample size) abs(lam - mean()) < 0.001 7.3.2 Markov Chains Markov chains are a stochastic process in which the next state depends only on the current state. Consider a random variable X and a time index t. The state of X at time t is notated Xt . For a Markov chain, the state Xt+1 depends only on the current state Xt , that is: P (Xt+1 = xt+1 |Xt = xt , Xt−1 = xt−1 , . . . , X0 = x0 ) = P (Xt+1 = xt+1 |Xt = xt ) Where P (Xt+1 = xt+1 ) is the transition probability of Xt+1 = xt+1 . The collection of transition probabilities is called a transition matrix (for discrete states); more generally is is called a transition kernel. If we consider t going to infinity, the Markov chain settles on a stationary distribution, where P (Xt ) = P (Xt−1 ). The stationary distribution does not depend on the initial state of the network. Markov chains are erdogic, i.e. they “mix”, which means that the influence of the initial state weakens with time (the rate at which it mixes is its mixing speed). If we call the k × k transition matrix P and the marginal probability of a state at time t is a k × 1 vector π, then the distribution of the state at time t + 1 is π ′ P . If π ′ P = π ′ , then pi is the stationary distribution of the Markov chain. 7.3.3 Markov Chain Monte Carlo MCMC is useful because often we may encounter distributions which aren’t easily expressed mathematically (e.g. their functions may have very strange shapes), but we still want to compute some 202 CHAPTER 7. BAYESIAN STATISTICS 203 7.3. MARKOV CHAIN MONTE CARLO (MCMC) descriptive statistics (or make other computations) from them. MCMC allows us to work with such distributions without needing precise mathematical formulations of them. More generally, MCMC is really useful if you don’t want to (or can’t) find the underlying function describing something. As long as you can simulate that process in some way, you don’t need to know the exact function - you can just generate enough sample data to work with in its stead. So MCMC is a brute force but effective method. Rather than directly compute the integral for posterior distributions in Bayesian analysis, we can instead use MCMC to draw several (thousands, millions, etc) samples from the probability distribution, then use these samples to compute whatever descriptions we’d like about the distribution (often this is some expected value of a function, E[f (x)], where its inputs are drawn from distribution, i.e. x ∼ p, where p is some probability distribution). You start with some random initial sample and, based on that sample, you pick a new sample. This is the Markov Chain aspect of MCMC - the next sample you choose depends only on the current sample. This works out so that you spend most your time with high probability samples (b/c they have higher transition probabilities) but occasionally jump out to lower probability samples. Eventually the MCMC chain will converge on a random sample. So we can take all these N samples and, for example, compute the expected value: E[f (x)] ≈ N 1∑ f (xi ) N i=1 Because of the random initialization, there is a “burn-in” phase in which the sampling model needs to be “warmed up” until it reaches an equilibrium sampling state, the stationary distribution. So you discard the first hundred or thousand or so samples as part of this burn-in phase. You can (eventually) arrive at this stationary distribution independent of where you started which is why the random initialization is ok - this is an important feature of Markov Chains. MCMC is a general technique of which there are several algorithms. Rejection Sampling Monte Carlo integration allows us to draw samples from a posterior distribution with a known parametric form. It does not, however, enable us to draw samples from a posterior distribution without a known parametric form. We may instead use rejection sampling in such cases. We can take our function f (x) and if it has bounded/finite support (“support” is the x values where f (x) is non-zero, and can be thought of the range of meaningful x values for f (x)), we can calculate its maximum and then define a bounding rectangle with it, encompassing all of the support values. This envelope function should contain all possible values of f (x) Then we can randomly generate points from within this box and check if they are under the curve (that is, less than f (x) for the point’s x value). If a point is not under the curve, we reject it. Thus we approximate the integral like so: ∫ B points under curve × box area = lim f (x)dx n→∞ A points generated CHAPTER 7. BAYESIAN STATISTICS 203 7.3. MARKOV CHAIN MONTE CARLO (MCMC) 204 In the case of unbounded support (i.e. infinite tails), we instead choose some majorizing or enveloping function g(x) (g(x) is typically a probability density itself and is called a proposal density) such that cg(x) ≥ f (x) , ∀x ∈ (−∞, ∞), where c is some constant. This functions like the bounding box from before. It completely encloses f . Ideally we choose g(x) so that it is close to the target distribution, that way most of our sampled points can be accepted. Then, for each xi we draw (i.e. sample), we also draw a uniform random value ui . Then if ui < we accept xi , otherwise, we reject it. f (xi ) cg(xi ) , The intuition here is that the probability of a given point being accepted is proportional to the function f at that point, so when there is greater density in f for that point, that point is more likely to be accepted. In multidimensional cases, you draw candidates from every dimension simultaneously. Metropolis-Hastings The Metropolis-Hastings algorithm uses Markov chains with rejection sampling. The proposal density g(θt ) is chosen as in rejection sampling, but it depends on θt−1 , i.e. g(θt |θt−1 ). First select some initial θ, θ1 . Then for n iterations: • • • • Draw a candidate θtc ∼ g(θt |θt−1 ) f (θtc )g(θt−1 |θtc ) Compute the Metropolis-Hastings ratio: R = f (θt−1 )g(θtc |θt−1 ) Draw u ∼ Uniform If u < R, accept θt = θtc , otherwise, θt = θt−1 There are a few required properties of the Markov chain for this to work properly: • The stationary distribution of the chain must be the target density: – The chain must be recurrent - that is, for all θ ∈ Θ in the target density (the density we wish to approximate), the probability of returning to any state θi ∈ T heta = 1. That is, it must be possible eventually for any state in the state space to be reached. – The chain must be non-null for all θ ∈ Θ in the target density; that is, the expected time to recurrence is finite. – The chain must have a stationary distribution equal to the target density. • The chain must be ergodic, that is: – The chain must be irreducible - that is, any state θi can be reached from any other state θj in a finite number of transitions (i.e. the chain should not get stuck in any infinite loops) 204 CHAPTER 7. BAYESIAN STATISTICS 205 7.3. MARKOV CHAIN MONTE CARLO (MCMC) – The chain must be aperiodic - that is, there should not be a fixed number of transitions to get from any state θi to any state θj . For instance, it should not always take three steps to get from one place to another - that would be a period. Another way of putting this - there are no fixed cycles in the chain. It can been proven that the stationary distribution of the Metropolis-Hastings algorithm is the target density (proof omitted). Th ergodic property (whether or not the chain “mixes” well) can be validated with some convergence diagnostics A common method is to plot the chain’s values as their drawn and see if the values tend to concentrate around a constant; if not, you should try a different proposal density. Alternatively, you can look at an autocorrelation plot, which measures the internal correlation (from -1 to 1) over time, called “lag”. We expect that the greater the lag, the less the points should be autocorrelated - that is, we expect autocorrelation to smoothly decrease to 0 with increasing lag. If autocorrelation remains high, then the chain is not fully exploring the space. Autocorrelation can be improved by thinning, which is a technique where only every kth draw is kept and others are discarded. Finally, you also have the options of running multiple chains, each with different starting values, and combining those samples. You should also use burn-in. Gibbs Sampling It is easy to sample from simple distributions. For example, for a binomial distribution, you can basically just flip a coin. For a multinomial distribution, you can basically just roll a dice. If you have a multinomial, multivariate distribution, e.g. P (x1 , x2 , . . . , xn ), things get more complicated. If the variables are independent, you can factorize the multivariate distribution as a product of univariate distributions, treating each as a univariate multinomial distribution, i.e. P (x1 , x2 , . . . , xn ) = P (x1 ) × P (x2 ) × · · · × P (xn ). Then you can just sample from each distribution individually, i.e. as a dice roll. However - what if these aren’t independent, and we want to sample from the joint distribution P (x1 , x2 , . . . , xn )? We can’t factorize it into simpler distributions like before. With Gibbs sampling we can approximate this joint distribution under the condition that we can easily sample from the conditional distribution for each variable, i.e. P (xi |x1 , . . . , xi−1 , xi+1 , . . . , xn ). (This condition is satisfied on Bayesian networks.) We take advantage of this and iteratively sample from these conditional distributions and using the most recent value for each of the other variables (starting with random values at first). For example, sampling x1 |x2 , . . . , xn , then fixing this value for x1 while sampling x2 |x1 , x3 , . . . , xn , then fixing both x1 and x2 while sampling x3 |x1 , x2 , x4 , . . . , xn , and so on. If you iterate through this a large number of times you get an approximation of samples taken from the actual joint distribution. Another way to look at Gibbs sampling: CHAPTER 7. BAYESIAN STATISTICS 205 7.3. MARKOV CHAIN MONTE CARLO (MCMC) 206 Say you have random variables c, r, t (cloudy, raining, thundering) and you have the following probability tables: c P(c) 0 0.5 1 0.5 c r P(r|c) 0 0 0.9 0 1 0.1 1 0 0.1 1 1 0.9 c r t P(t|c,r) 0 0 0 0.9 0 0 1 0.1 0 1 0 0.5 0 1 1 0.5 1 0 0 0.6 1 0 1 0.4 1 1 0 0.1 1 1 1 0.9 We can first pick some starting sample, e.g. c = 1, r = 0, t = 1. Then we fix r = 0, t = 1 and randomly pick another c value according to the probabilities in the table (here it is equally likely that we get c = 0 or c = 1). Say we get c = 0. Now we have a new sample c = 0, r = 0, t = 1. Now we fix c = 0, t = 1 and randomly pick another r value. Here r is dependent only on c. c = 0 so we have a 0.9 probability of picking r = 0. Say that we do. We have another sample c = 0, r = 0, t = 1, which happens to be the same as the previous sample. Now we fix c = 0, r = 0 and pick a new t value. t is dependent on both c and r . c = 0, r = 0, so we have a 0.9 chance of picking t = 0. Say that we do. Now we have another sample c = 0, r = 0, t = 0. Then we repeat this process until convergence (or for some specified number of iterations). Your samples will reflect the actual joint distribution of these values, since more likely samples are, well, more likely to be generated. 206 CHAPTER 7. BAYESIAN STATISTICS 207 7.4 7.4. VARIATIONAL INFERENCE Variational Inference MCMC can take a long time to get good answers - in theory if you run it infinitely it will generate enough samples to get a perfectly accurate distribution, but that’s not a fair criterion (many algorithms do well if they have infinite time). With variational inference we don’t need to take samples - instead we fit an approximate joint distribution Q(x; θ) to approximate the true joint posterior P (x), turning it into an optimization problem (we try to get them as close as possible according to the KL divergence KL[Q(x; θ)||P (x)]. So we are interested in the parameters θ. The mean-field form of variational inference assumes that Q factorizes into independent single-variable ∏ factors, i.e. Q(x) = i Qi (xi |θi ). 7.5 Bayesian point estimates Bayesian inference returns a distribution (the posterior) but we often need a single value (or a vector in multivariate cases). So we choose a value from the posterior. This value is a Bayesian point estimate. Selecting the MAP (maximum a posterior) value is insufficient because it neglects the shape of the distribution. Suppose P (θ|X) is the posterior distribution of θ after observing data X. The expected loss of choosing estimate θ̂ to estimate θ (the true parameter), also known as the risk of estimate θ̂ is: l(θ̂) = Eθ [L(θ, θ̂)] Where L(θ, θ̂) is some loss function. You can approximate the expected loss using the Law of Large Numbers, which just states that as sample size grows, the expected value approaches the actual value. That is, as N grows, the expected loss approaches 0. For approximating expected loss, it looks like: N 1∑ L(θi , θ̂) ≈ Eθ [L(θ, θ̂)] = l(θ̂) N i=1 You want to select the estimate θ̂ which minimizes this expected loss: argmin Eθ [L(θ, θ̂)] θ̂ CHAPTER 7. BAYESIAN STATISTICS 207 7.6. CREDIBLE INTERVALS (CREDIBLE REGIONS) 7.6 208 Credible Intervals (Credible Regions) In Bayesian statistics, The closest analog to confidence intervals in frequentist statistics is the credible interval. It is much easier to interpret than the confidence interval because it is exactly what most people confuse the confidence interval to be. For instance, the 95% credible interval is the interval in which we expect to find θ 95% of the time. Mathematically this is expressed as: P (a(y ) < θ < b(y )|Y = y ) = 0.95 We condition on Y because in Bayesian statistics, the data is fixed and the parameters are random. 7.7 Bayesian Regression The Bayesian methodology can be applied to regression as well. In conventional regression the parameters are treated as fixed values that we uncover. In Bayesian regression, the parameters are treated as random variables, as they are elsewhere in Bayesian statistics. We define prior distributions for each parameter - in particular, normal priors, so that for each parameter we define a prior mean as well as a covariance matrix for all the parameters. So we specify: • • • • b0 - a vector of prior means for the parameters B0 - a covariance matrix such that σ 2 B0 is the prior covariance matrix of β v0 > 0 - the degrees of freedom for the prior σ02 > 0 - the variance for the prior (which essentially functions as your strength of belief in the prior - the lower the variance, the more concentrated your prior is around the mean, thus the stronger your belief) So the prior for your parameters then is a normal distribution parameterized by (b0 , B0 ). Then v0 and σ02 give a prior for σ 2 , which is an inverse gamma distribution parameterized by (v0 , σ02 v0 ). Then there are a few formulas: 208 CHAPTER 7. BAYESIAN STATISTICS 209 7.8. A BAYESIAN EXAMPLE b1 = (B0−1 + X ′ X)−1 (B0−1 b0 + X ′ X β̂) B1 = (B0−1 + X ′ X)−1 v1 = v0 + n v1 σ12 = v0 σ02 + S + r S = sum of squared errors of the regression r = (b0 − β̂)′ (B0 + (X ′ X)−1 )−1 (b0 − β̂) f (β | σ 2 , y , x) = Φ(b1 , σ 2 B1 ) f (σ 2 | y , x) = inv.gamma( f (β | y , x) = ∫ v1 v1 σ12 , ) 2 2 f (β | σ 2 , y , x)f (σ 2 | y , x)dσ 2 = t(b1 , σ12 B1 , degrees of freedom = v1 ) So the resulting distribution of parameters is a multivariate t distribution. 7.8 A Bayesian example Let’s say we have a coin. We are uncertain whether or not it’s a fair coin. What can we learn about the coin’s fairness from a Bayesian approach? Let’s restate the problem. We can represent the outcome of a coin flip with a random variable, X. If the coin is not fair, we expect to see heads 100% of the time. That is, if the coin is unfair, P (X = heads) = 1. Otherwise, we expect it to be around P (X = heads) = 0.5. It’s reasonable to assume that X is drawn from a binomial distribution, so we’ll use that. The binomial distribution is parameterized by n, the number of trials, and p, the probability of a “success” (in this case, a heads), on a given flip. We can restate our previous statements about the coin’s fairness in terms of this parameter p. That is, if the coin is unfair, we expect p = 1, otherwise, we expect it to be around p = 0.5. Thus p is the unknown parameter we are interested in, and with the Bayesian approach, we consider it a random variable as well; i.e. drawn from some distribution. First we must state what we believe this distribution to be prior to any evidence (i.e. decide on a prior to use). Because p is a probability, the beta distribution seems like a good choice since it is bound to [0, 1] like a probability. The beta distribution has the additional advantage of being a conjugate prior, so the posterior is analytically derived and requires no simulation. The beta distribution is parameterized by α and β (i.e. they are our hyperparameters, Beta(α, β)). Here we can choose values for α and β depending on how we choose to proceed. Let’s be conservative and use an uninformative prior, that is, a uniform/flat prior, acting as if we don’t feel strongly about the coin’s bias either way prior to flipping the coin. The beta distribution Beta(1, 1) is flat. The posterior for a beta prior will not be derived here, but it is Beta(α + k, β + (n − k)), where k is the number of successes (heads) in our evidence, and n is the total number of trials in our evidence. Now we can flip the coin a few times to gather our evidence. CHAPTER 7. BAYESIAN STATISTICS 209 7.8. A BAYESIAN EXAMPLE 210 Below are some illustrations of possible evidence with the prior and the resulting posterior. Some possible outcomes with a flat prior A few things to note here: • When the evidence has even amounts of tails and heads, the posterior centers around p = 0.5. • When the evidence has even one tail, the possibility of p = 1 drops to nothing. • When the evidence has no tails, the posterior places more weight on an unfair coin, but there is still some possibility of p = 0.5. As the number of evidence increases, however, and still no tails show up, the posterior will have even more weight pushed towards p = 1. • When the is a lot of evidence containing even amounts of tails and heads, there is greater confidence that p = 0.5 (that is, there’s smaller variance around it). What if instead of a flat prior, we had assumed that the coin was fair to begin with? In this scenario, the α and β values function like counts for heads and tails. So to assume a fair coin we could say, 210 CHAPTER 7. BAYESIAN STATISTICS 211 7.8. A BAYESIAN EXAMPLE α = 10, β = 10. If we have a really strong belief that it is a fair coin, we could say α = 100, β = 100. The higher these values are, the stronger our belief. Some possible outcomes with a informative prior Since our prior belief is stronger than it was with a flat prior, the same amount of evidence doesn’t change the prior belief as much. For instance, now if we see a streak of heads, we are less convinced it is unfair. In either case, we could take the expected value of p’s posterior distribution as our estimate for p, and then use that as evidence for a fair or unfair coin. 7.8.1 References • POLS 506: Simple Bayesian Models. Justin Esarey. • POLS 506: Basic Monte Carlo Procedures and Sampling. Justin Esarey. • POLS 506: Metropolis-Hastings, the Gibbs Sampler, and MCMC. Justin Esarey. CHAPTER 7. BAYESIAN STATISTICS 211 7.8. A BAYESIAN EXAMPLE • • • • • • • • • • • • • • • • 212 212 Markov chain Monte Carlo (MCMC) introduction. mathematicalmonk. Markov Chain Monte Carlo Without all the Bullshit. Jeremy Kun. http://homepages.dcc.ufmg.br/~assuncao/pgm/aulas2014/mcmc-gibbs-intro.pdf Markov Chain Monte Carlo and Gibbs Sampling. B. Walsh. Computational Methods in Bayesian Analysis. Chris Fonnesbeck. Think Bayes. Version 1.0.6. Allen Downey. Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015. Bayesian Statistical Analysis. Chris Fonnesbeck. SciPy 2014. Probabilistic Programming and Bayesian Methods for Hackers. Cam Davidson Pilon. Frequentism and Bayesianism V: Model Selection. Jake Vanderplas. Understanding Bayes: A Look at the Likelihood. Alex Etz. High-Level Explanation of Variational Inference. Jason Eisner. Bayesian Deep Learning. Thomas Wiecki. A Tutorial on Variational Bayesian Inference. Charles Fox, Stephen Roberts. Variational Inference. David M. Blei. Probabilistic Programming Data Science with PyMC3. Thomas Wiecki. CHAPTER 7. BAYESIAN STATISTICS 213 8 Graphs A graph consists of vertices (nodes) connected by edges (arcs). The two main categories of graphs are undirected graphs, in which edges don’t have any particular direction, and directed graphs, where edges have direction - for example, there may be an edge from node A to node B, but no edge from node B to node A. A directed graph Generally, we use the following notation for a graph G: n is the number of vertices, i.e. |V |, m is the number of edges, i.e. |E|. Within undirected graphs there are other distinctions: • a simple graph is a graph in which there are no loops and only single edges are allowed • a regular graph is one in which each vertex has the same number of neighbors (i.e. the same degree) • a complete graph is a simple graph where every pair of vertices is connected by an edge • a connected graph is a graph where there exists a path between every pair of vertices CHAPTER 8. GRAPHS 213 214 An undirected graph For a connected graph with no parallel edges (i.e. each pair of vertices has only zero or one edge between it), m is somewhere between Ω(n) and O(n2 ). Generally, a graph is said to be sparse if m is O(n) or close to it (that is, it has the lower end of number of edges). If m is closer to O(n2 ), this is generally said to be a dense graph. An adjacency matrix requires Θ(n2 ) space. If the graph is sparse, this is a waste of space, and an adjacency list is more appropriate - you have an array of vertices and an array of edges. Each edge points to its endpoints, and each vertex points to edges incident on it. This requires Θ(m + n) space (because the array of vertices takes Θ(n) space and the arrays of edges, edge-to-endpoints, and vertex-to-edges each take Θ(m), for Θ(n + 3m) = Θ(m + n)), so it is better for sparse graphs. Consider the accompanying example graph. Example graph A path A → B is the sequence of nodes connecting A to B, including A and B. Here the path A → B is A, C, E, B. A, C, E are the ancestors of B. C, E, B are the descendants of A. A cycle is a directed path which ends where it starts. Here, A, D, C, B form a cycle. 214 CHAPTER 8. GRAPHS 215 Example of a cyclic graph A loop is any path (directed or not) which ends where it starts. A graph with no cycles is said to be acyclic. A chord is any edge which connects two non-adjacent nodes in a loop. Directed acyclic graphs (DAGs) are used often. In a DAG, parents are the nodes which point to a given node; the nodes that a given node points to are its children. A family of a node is the node and its parents. The Markov blanket of a node is its parents, children, and the parents of its children (including itself). For an undirected graph, a node’s neighbors are noes directly connected to it. In an undirected graph, a clique is a fully-connected subset of nodes. All members of a clique are neighbors. A maximal clique is a clique which is not a subset of another clique. A graph demonstrating cliques In this graph: CHAPTER 8. GRAPHS 215 216 • {A, B, C, D} is a maximal clique • {B, C, E} is a maximal clique • {A, B, C} is a non-maximal clique, contained in {A, B, C, D} An undirected graph is connected if there is a path between every pair of nodes. That is, there is no isolated subgraph. For a non-connected graph, its connected components are its subgraphs which are connected. A singly connected graph is the connected graph (directed or not) where there is only one path between each pair of nodes. This is equivalent to a tree. A non-singly connected graph is said to be multiply connected. A spanning tree of an undirected graph is one where the sum of all edge weights is at least as large as any other spanning trees’. Spanning tree example. The graph on the right is a spanning tree of the graph on the left. Graphs can be represented as an adjacency matrix: 0 1 A= 1 0 1 0 1 1 1 1 0 1 0 1 1 0 Where a non-zero value at Aij indicates that node i is connected to node j. A clique matrix represents the maximal cliques in a graph. For example, this clique matrix describes the following maximal clique: 216 CHAPTER 8. GRAPHS 217 8.1. REFERENCES 1 1 C= 1 0 0 1 1 1 A maximal clique A clique matrix containing only 2-node cliques is an incidence matrix: 1 1 C= 0 0 8.1 1 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 References • Algorithms: Design and Analysis, Part 1. Tim Roughgarden. Stanford/Coursera. • Bayesian Reasoning and Machine Learning. David Barber. • Probabilistic Graphical Models. Daphne Koller. Stanford University/Coursera. CHAPTER 8. GRAPHS 217 8.1. REFERENCES 218 218 CHAPTER 8. GRAPHS 219 9 Probabilistic Graphical Models The key tool for probabilistic inference is the joint probability table. Each row in a joint probability table describes a combination of values for a set of random variables. That is, say you have n events which have a binary outcome (T/F). A row would describe a unique configuration of these events, e.g. if n = 4 then one row might be 0, 0, 0, 0 and another might be 1, 0, 0, 0 and so on. Consider the simpler case of n = 2, with binary random variables X, Y : X Y P(X,Y) 0 0 0.25 1 0 0.45 0 1 0.15 1 1 0.15 Using a joint probability table you can learn a lot about how those events are related probabilistically. The problem is, however, that joint probability tables can get very big, which is another way of saying that models (since joint probability tables are a representation of probabilistic models) can get complex very quickly. Typically, we have a set of random variables x1 , . . . , xn and we want to compute their probability for certain states together; that is, the joint distribution P (x1 , . . . , xn ). Even in the simple case where each random variable is binary, you would still have a distribution over 2n states. We can use probabilistic graphical models (PGMs) to reduce this space. Probabilistic graphical models allow us to represent complex networks of interrelated and independent events efficiently and CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 219 9.1. FACTORS 220 with sparse parameters. All graphical models have some limitations in their ability to graphically express conditional (in)dependence statements but are nevertheless very useful. There are two main types of graphical models: • Bayesian models: aka Bayesian networks, sometimes called Bayes nets or belief networks. These use directed graphs and are used when there are causal relationships between the random variables. • Markov models: These use undirected graphs and are used when there are noncausal relationships between the random variables. 9.1 Factors The concept of factors is important to PGMs. A factor is a function ϕ(X1 , . . . , Xk ) which takes all possible combinations of outcomes (assignments) for these random variables X1 , . . . , Xk and gives a real value for each combination. The set of random variables {X1 , . . . , Xk } is called the scope of the factor. A joint distribution is a factor which returns a number which is the probability of a given combination of assignments. An unnormalized measure is also a factor, e.g. P (I, D, g 1 ). A conditional probability distribution (CPD) is also a factor, e.g. P (G|I, D). A common operation on factors is a factor product. Say we have the factors ϕ1 (A, B) and ϕ2 (B, C). Their factor product would yield a new factor ϕ3 (A, B, C). The result for a given combo ai , bj , ck is just ϕ1 (ai , bj ) · ϕ2 (bj , ck ). Another operation is factor marginalization. This is the same as marginalization for probability distributions but generalized for all factors. For example, ϕ(A, B, C) → ϕ(A, B). Another operation is factor reduction which is similarly is a generalization of probability distribution reduction. 9.2 Belief (Bayesian) Networks Say we are looking at five events: • • • • • 220 a dog barking (D) a raccoon being present (R) a burglar being present (B) a trash can is heard knocked over (T ) the police are called (P ) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 221 9.2. BELIEF (BAYESIAN) NETWORKS Belief Network We can encode some assumptions about how these events are related in a belief net (also called a Bayesian net): Every node is dependent on its parent and nothing else that is not a descendant. To put it another way: given its parent, a node is independent of all its non-descendants. For instance, the event P is dependent on its parent D but not B or R or T because their causality flows through D. D depends on B and R because they are its parents, but not T because it is not a descendant or a parent. But D may depend on P because it is a descendant. We can then annotate the graph with probabilities: Belief Network The B and R nodes have no parents so they have singular probabilities. The others depend on the outcome of their parents. With the belief net, we only needed to specify 10 probabilities. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 221 9.2. BELIEF (BAYESIAN) NETWORKS 222 If we had just constructed joint probability table, we would have had to specify 25 = 32 probabilities (rows). If we expand out the conditional probability of this system using the chain rule, it would look like: P (p, d, b, t, r ) = P (p|d, b, t, r )P (d|b, t, r )P (b|t, r )P (t|r )P (r ) But we can bring in our belief net’s conditional independence assumptions to simplify this: P (p, d, b, t, r ) = P (p|d)P (d|b, r )P (b)P (t|r )P (r ) Belief networks are acyclical, that is, they cannot have any loops (a node cannot have a path back to itself). In particular, they are a directed acyclic graph (DAG). Two nodes (variables) in a Bayes net are on an active trail if a change in one node affects the other. This includes cases where the two nodes have a causal relationship, an evidential relationship, or have some common cause. Formally, a belief network is a distribution of the form: P (x1 , . . . , xD ) = D ∏ P (xi |pa(xi )) i=1 where pa(xi ) are the parental variables of variable x (that is, x’s parents in the graph). When you factorize a joint probability, you have a number of options for doing so. For instance: P (x1 , x2 , x3 ) = P (xi1 |xi2 , xi3 )P (xi2 |xi3 )P (xi3 ) where (i1 , i2 , i3 ) is any permutation of (1, 2, 3). Without any conditional independence assumptions, all factorizations produce an equivalent DAG. However, once you begin dropping edges (i.e. making conditional independence assumptions), the graphs are not necessarily equivalent anymore. Some of the graphs are equivalent; they can be converted amongst each other via Bayes’ rule. Others cannot be bridged in this way, and thus are not equivalent. Note that belief networks encode conditional independences but do not necessarily encode dependences. For instance, the graph a → b appears to mean that a and b are dependent. But there may be an instance of the belief network distribution such that p(b|a) = p(b); that is, a and b are independent. So although the DAG may seem to imply dependence, there may be cases where it in fact does not. 222 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 223 9.2. BELIEF (BAYESIAN) NETWORKS In these cases, we call this implied dependence graphical dependence. The following belief network triple represents the conditional independence of X and Y given Z, that is P (X, Y |Z) = P (X|Z)P (Y |Z). Represents conditional independence between X and Y given Z The following belief network triple also represents the conditional independence of X and Y given Z, in particular, P (X, Y |Z) ∝ P (Z|X)P (X)P (Y |Z). Also represents conditional independence between X and Y given Z The following belief network triple represents the graphical conditional dependence of X and Y , that is P (X, Y |Z) ∝ P (Z|X, Y )P (X)P (Y ). Represents graphical conditional dependence of X and Y Here Z is a collider, since its neighbors are pointing to it. Generally, if there is a path between X and Y which contains a collider, and this collider is not in the conditioning set, nor are any of its descendants, we cannot induce dependence between X and Y from this path. We say such a path is blocked. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 223 9.2. BELIEF (BAYESIAN) NETWORKS 224 Similarly, if there is a non-collider along the path which is in the conditioning set, we cannot induce dependence between X and Y from this path - such a path is also said to be blocked. If all paths between X and Y are blocked, we say they are d-separated. However, if there are no colliders, or the colliders that are there are in the conditioning set or their descendants, and no non-collider conditioning variables in the path, we say this path d-connects X and Y and we say they are graphically dependent. Note that colliders are relative to a path. For example, in the accompanying figure, C is a collider for the path A − B − C − D but not for the path A − B − C − E. Collider example Consider the belief network A → B ← C. Here A and C are conditionally independent. However, if we condition them on B, i.e. P (A, C|B), then they become graphically dependent. That is, we belief the root “causes” of A and C to be independent, but given B we learn something about both the causes of A and C, which couples them, making them (graphically) dependent. Note that the term “causes” is used loosely here; belief networks really only make independence statements, not necessarily causal ones. (TODO the below is another set of notes for bayes’ nets, incorporate these two) Independence allows us to more compactly represent joint probability distributions, in that independent random variables can be represented as smaller, separate probability distributions. For example, if we have binary random variables A, B, C, D, we would have a joint probability table of 24 entries. However, if we know that A, B is independent of C, D, then we only need two joint probability tables of 22 entries each. Typically, independent is too strong an assumption to make for real-world applications, but we can often make the weaker, yet still useful assumption of conditional independence. Conditional independence is when one variable makes another variable irrelevant (because the other variable adds no additional information), i.e. P (A|B, C) = P (A|B); knowing C adds no more information when we know B. 224 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 225 9.2. BELIEF (BAYESIAN) NETWORKS For example, if C causes B and B causes A, then knowledge of B already implies C, so knowing about C is kind of useless for learning about A if we already know B. As a more concrete example, given random variables traffic T , umbrella U, and raining R, we could reasonably assume that U is conditionally independent of T given R, because rain is the common cause of the two and there is no direct relationship between U and T ; the relationship is through R. Similarly, given fire, smoke and an alarm, we could say that fire and alarm are conditionally independent given smoke. As mentioned earlier, we can apply conditional independence to simplify joint distributions. Take the traffic/umbrella/rain example from before. Their joint distribution is P (T, R, U, which we can decompose using the chain rule: P (T, R, U) = P (R)P (T |R)P (U|R, T ) If we make the conditional independence assumption from before (U and T are conditionally independent given R), then we can simplify this: P (T, R, U) = P (R)P (T |R)P (U|R) That is, we simplified P (U|R, T ) to P (U|R). We can describe complex joint distributions more simply with these conditional independence assumptions, and we can do so with Bayes’ nets (i.e. graphical models), which provide additional insight into the structure of these distributions (in particular, how variables interact locally, and how these local interactions propagate to more distant indirect interactions). A Bayes’ net is a directed acyclic graph. The nodes in the graph are the variables (with domains). They may be assigned (observed) or unassigned (unobserved). The arcs in the graphs are interactions between variables (similar to constraints in CSPs). They indicate “direct influence” between variables (not that this is not necessarily the same as causation, it’s about the information that observation of one variable gives about the other, which can mean causation, but not necessarily, e.g. it could simply be a hidden common underlying cause), which is to say that they encode conditional independences. For each node, we have a conditional distribution over the variable that node represents, conditioned on its parents’ values. Bayes’ nets implicitly encode joint distributions as a product of local conditional distributions: P (x1 , x2 , . . . , xn ) = n ∏ P (xi |parents(Xi )) i=1 This simply comes from the chain rule: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 225 9.2. BELIEF (BAYESIAN) NETWORKS 226 P (x1 , x2 , . . . , xn ) = n ∏ P (xi |x1 , . . . , xi−1 ) i=1 And then applying conditional independence assumptions. The graph must be acyclic so that we can come up with a consistent ordering when we apply the chain rule (that is, decide the order for expanding the distributions). If the graph has cycles, we can’t come up with a consistent ordering because we will have loops. Note that arcs can be “reversed” (i.e. parent and children can be swapped) and encode the same joint distribution - so joint distributions can be represented by multiple Bayes’ nets. But some Bayes’ nets are better representations than others - some will be easier to work with; in particular, if the arcs do represent causality, the network will be easier to work with. Bayes’ nets are much smaller than representing such joint distributions without conditional independence assumptions. A joint distribution over N boolean variables takes 2n space (as demonstrated earlier). A Bayes’ net, on the other hand, where the N nodes each have at most k parents, only requires size O(N ∗ 2k+1 ). The Bayes’ net also encodes additional conditional independence assumptions in its structure. For example, the Bayes’ net X → Y → Z → W encodes the joint distribution: P (X, Y, Z, W ) = P (X)P (Y |X)P (Z|Y )P (W |Z) This structure implies other conditional independence assumptions, e.g. that Z is conditionally independent of X given Y , i.e. P (Z|Y ) = P (Z|X, Y ). More generally we might ask: given two nodes, are they independent given certain evidence and the structure of the graph (i.e. assignments of intermediary nodes)? We can use the d-separation algorithm to answer this question. First, we consider three configurations of triples as base cases, which we can use to deal with more complex networks. That is, any Bayes’ net can be decomposed into these three triple configurations. A simple configuration of nodes in the form of X → Y → Z is called a causal chain and encodes the joint distribution P (x, y , z ) = P (x)P (y |x)P (z|y ). X is not guaranteed to be (unconditionally) independent of Z. However, is X guaranteed to be conditionally independent of Z given Y ? From the definition of conditional probability, we know that: P (z|x, y ) = 226 P (x, y , z) P (x, y ) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 227 9.2. BELIEF (BAYESIAN) NETWORKS With the Bayes’ net, we can simplify this (the numerator comes from the joint distribution the graph encodes, as demonstrated previously, and the denominator comes from applying the product rule): P (z|x, y ) = P (x)P (y |x)P (z|y ) P (x)P (y |x) Then, canceling a few things out: P (z|x, y ) = P (z|y ) So yes, X is guaranteed to be conditionally independent of Z given Y (i.e. once Y is observed). We say that evidence along the chain “blocks” the influence. Another configuration of nodes is a common cause configuration: Common cause configuration The encoded joint distribution is P (x, y , z) = P (y )P (x|y )P (z|y ). Again, X is not guaranteed to be (unconditionally) independent of Z. Is X guaranteed to be conditionally independent of Z given Y ? Again, we start with the definition of conditional probability: P (z|x, y ) = P (x, y , z) P (x, y ) Apply the product rule to the denominator and replace the numerator with the Bayes’ net’s joint distribution: P (z|x, y ) = P (y )P (x|y )P (z|y ) P (y )P (x|y ) Yielding: P (z|x, y ) = P (z|y ) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 227 9.2. BELIEF (BAYESIAN) NETWORKS 228 So again, yes, X is guaranteed to be conditionally independent of Z given Y . Another triple configuration is the common effect configuration (also called v-structures): Common effect configuration X and Y are (unconditionally) independent here. However, is X guaranteed to be conditionally independent of Y given Z? No - observing Z puts X and Y in competition as the explanation for Z (this is called causal competition). That is, having observed Z, we think that X or Y was the cause, but not both, so now they are dependent on each other (if one happened, the other didn’t, and vice versa). Consider the following Bayes’ net: Example Bayes’ Net Where our random variables are rain R, dripping roof D, low pressure L, traffic T , baseball game B. The relationships assumed here are: low pressure fronts cause rain, rain or a baseball game causes traffic, and rain causes your friend’s roof to drip. Given that you observe traffic, the probability that your friend’s roof is dripping goes up - since perhaps the traffic is caused by rain, which would cause the roof to drip. This relationship is encoded in the graph the path between T and D. However - if we observe that it is raining, then observation of traffic has no more effect on D intuitively, this makes sense - we already know it’s raining, so seeing traffic doesn’t tell us more about the roof dripping. In this sense, observing R “blocks” the path between T and D. One exception here is the v-structure with R, B, T . Observing that a baseball game is happening affects our belief about it raining only if we have observed T . Otherwise, they are independent. So v-structures are “reversed” in some sense. 228 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 229 9.2. BELIEF (BAYESIAN) NETWORKS That is, we must observe T to activate the path between R and B. Thus we make the distinction between active triples, in which information “flows” as it did with the path between T and D and between R and B when T is observed, and inactive triples, in which this information is “blocked”. Active triples are chain and common cause configurations in which the central node is not observed and common effect configurations in which the central node is observed, or common effect configurations in which some child node of the central node is observed. An example for the last case: Triple example If Z, A, B or C are observed, then the triple is active. Inactive triples are chain and common cause configurations in which the central node is observed and common effect configurations in which the central node is not observed. So now, if we want to know if two nodes X and Y are conditionally independent given some evidence variables {Z}, we check all undirected paths from X to Y and see if there are any active paths (by checking all its constituent triples). If there are none, then they are conditionally independent, and we say that they are d-separated. Otherwise, conditional independence is not guaranteed. This is the d-separation algorithm. You can apply d-separation to a Bayes net and get a complete list of conditional independences that are necessarily true given certain evidence. This tells you the set of probability distributions that can be represented. 9.2.1 • • • • Conditional independence assumptions Sally comes home and hears the alarm (A = 1) Has she been burgled? (B = 1) Or was the alarm triggered by an earthquake? (E = 1) She hears on the radio that there was an earthquake (R = 1) We start with P (A, B, E, R) and apply the chain rule of probability: P (A, B, E, R) = P (A|B, E, R)P (R|B, E)P (E|B)P (B) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 229 9.2. BELIEF (BAYESIAN) NETWORKS 230 Then we can make some conditional independence assumptions: • The radio report has no effect on the alarm: P (A|B, E, R) → P (A|B, E) • A burglary has no effect on the radio report: P (R|B, E) → P (R|E) • A burglary would have no effect on the earthquake: P (E|B) → P (E) Thus we have simplified the computation of the joint probability distribution: P (A, B, E, R) = P (A|B, E)P (R|E)P (E)P (B) We can also construct a belief network out of these conditional independence assumptions: A simple belief network Say we are given the following probabilities: P (B = 1) = 0.01 P (E = 1) = 0.0000001 P (A = 1|B = 1, E = 1) = 0.9999 P (A = 1|B = 0, E = 1) = 0.99 P (A = 1|B = 1, E = 0) = 0.99 P (A = 1|B = 0, E = 0) = 0.0001 P (R = 1|E = 1) = 1 P (R = 1|E = 0) = 0 First consider if Sally has not yet heard the radio; that is, she has only heard the alarm (so the only evidence she has is A = 1). Sally wants to know if she’s been burgled, so her question is P (B = 1|A = 1): 230 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 231 9.2. BELIEF (BAYESIAN) NETWORKS P (B = 1, A = 1) (Bayes’ rule) P (A = 1) ∑ E,R P (B = 1, A = 1, E, R) = ∑ (marginal prob to joint prob) B,E,R P (B, E, A = 1, R) ∑ P (A = 1|B = 1, E)P (B = 1)P (E)P (R|E) ∑ = E,R (chain rule w/ our indep. assumps) B,E,R P (A = 1|B, E)P (B)P (E)P (R|E) P (B = 1|A = 1) = ≈ 0.99 Now consider that Sally has also heard the report, i.e. R = 1. Now her question is P (B = 1|A = 1, R = 1): P (B = 1, A = 1, R = 1) (Bayes’ rule) P (A = 1, R = 1) ∑ E P (B = 1, A = 1, R = 1, E) = ∑ (marginal prob to joint prob) B,E P (A = 1, R = 1, B, E) ∑ P (A = 1|B = 1, E)P (B = 1)P (E)P (R = 1|E) = E∑ (chain rule w/ our indep. assumps) B,E P (A = 1|B, E)P (B)P (E)P (R = 1|E) P (B = 1|A = 1) = ≈ 0.01 So hearing the report and learning that there was an earthquake makes the burglary much less likely. We may, however, only have soft or uncertain evidence. For instance, say Sally is only 70% sure that she heard the alarm. We denote our soft evidence of the alarm’s ringing as à = (0.7, 0.3), which is to say P (A = 1) = 0.7 and P (A = 0) = 0.3. We’re ignoring the case with the report (R = 1) for simplicity, but with this uncertain evidence we would calculate: P (B = 1|Ã) = ∑ P (B = 1|A)P (A|Ã) A = 0.7P (B = 1|A = 1) + 0.3P (B = 1|A = 0) Unreliable evidence is distinct from uncertain evidence. Say we represent Sally’s uncertainty of hearing the alarm, as described before, as P (S|A) = 0.7. Now say for some reason we feel that Sally is unreliable for other reasons (maybe she lies a lot). We would then replace the term P (S|A) with our own interpretation P (H|A). For example, if Sally tells us her alarm went off, maybe we think that means there’s a 60% chance that the alarm actually went off. This new term P (H|A) is our virtual evidence, also called likelihood evidence. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 231 9.2. BELIEF (BAYESIAN) NETWORKS 9.2.2 232 Properties of belief networks A note on the following graphics: the top part shows the belief network, where a faded node means it has been marginalized out, and a filled node means it has been observed/conditioned on. The bottom part shows the relationship between A and B after the marginalization/conditioning. P (A, B, C) = P (C|A, B)P (A)P (B) A and B are independent and determine C. If we marginalize over C (thus “removing” it), A and B are made conditionally independent. That is, P (A, B) = P (A)P (B). If we instead condition on C, A and B become graphically dependent. Although A and B are a priori independent, knowing something about C tells us a bit about A and B. If we introduce D as a child to C, i.e. D is a descendant of a collider C, then conditioning on D also makes A and B graphically dependent. 232 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 233 9.2. BELIEF (BAYESIAN) NETWORKS In this arrangement, C is the “cause” and A and B are independent effects: P (A, B, C) = P (A|C)P (B|C)P (C). Here, marginalizing over C makes A and B graphically dependent. In general, P (A, B) ̸= P (A)P (B) because they share the same cause. Conditioning on C makes A and B independent: P (A, B|C) = P (A|C)P (B|C). This is because if you know the “cause” C then you know how the effects A and B occur independent of each other. The same applies for this arrangement - here A “causes” C and C “causes” B. Conditioning on C blocks A’s ability to influence B. These graphs all encode the same conditional independence assumptions. For both directed and undirected graphs, two graphs are Markov equivalent if they both represent the same set of conditional independence statements. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 233 9.2. BELIEF (BAYESIAN) NETWORKS 234 234 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 235 9.2. BELIEF (BAYESIAN) NETWORKS 9.2.3 Example Consider a joint distribution over the following random variables: • • • • • G, grade: g 1 for A, g 2 for B, g 3 for C I, intelligence, binary: −i for low, +i for high D, difficulty of the course, binary: −d for easy, +d for hard S, SAT score, binary: −s for low, +s for high L, reference letter, binary: −l for not received, +l for received We can encode some conditional independence assumptions about these random variables into a belief net: An example belief network for this scenario • the grade depends on the student’s intelligence and difficulty of the course • the student’s SAT score seems dependent on only their intelligence • whether or not a student receives a recommendation letter depends on their grade Note that we could add the assumption that intelligence students are likely to take more difficult courses, if we felt strongly about it: To turn this graph into a probability distribution, we can represent each node as a CPD: Then we can apply the chain rule of Bayesian networks which just multiplies all the CPDs: P (D, I, G, S, L) = P (D)P (I)P (G|I, D)P (S|I)P (L|G) A Bayesian network (BN) is a directed acyclic graph where its nodes represent the random variables X1 , . . . , Xn . For each node Xi we have a CPD P (Xi |ParG (Xi ), where ParG (Xi ) refers to the parents of Xi in the graph G. In whole, the BN represents a joint distribution via the chain rule for BNs: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 235 9.2. BELIEF (BAYESIAN) NETWORKS 236 An alternative belief network for this scenario The belief network annotated with nodes’ distributions 236 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 237 9.2. BELIEF (BAYESIAN) NETWORKS P (X1 , . . . , Xn ) = ∏ P (Xi |ParG (Xi )) i We say a probability distribution P factorizes over a BN graph G if the B chain rule holds for P . There are three types of reasoning that occur with a BN: • Causal reasoning includes conditioning on an ancestor to determine a descendant’s probability, e.g. P (L = 1|I = 0). • Evidential reasoning goes the other way: given a state for a descendant, get the probability for an ancestor, e.g. P (I = 0|G = 3). • Intercausal reasoning - consider P (I = 1|G = 3, D = 1). The D node is not directly connected to the I node, yet conditioning on it does affect the probability. As the simplest example of intercausal reasoning, consider an OR gate: An OR gate as a belief network Knowing Y and X1 (or X2 ) tells you the value of X2 (or X1 ) even though X1 and X2 are not directly linked. Knowing Y alone does not tell you anything about X1 or X2 ’s values. There are a few different structures in which a variable X can influence a variable Y , i.e. change beliefs in Y when conditioned on X: • • • • • X X X X X →Y ←Y →W →Y ←W ←Y ←W →Y Which the different reasonings described above capture. The one structure which “blocks” influence is X → W ← Y . That is, where two causes have a joint effect. This is called a v-structure. A trail is a sequence of nodes that are connected to each other by single edges in the graph. A trail X1 − · · · − Xk is active (if there is no evidence) if it has no v-structures Xi−1 → Xi ← Xi+1 , where Xi is the block. When can variable X can influence a variable Y given evidence Z? CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 237 9.2. BELIEF (BAYESIAN) NETWORKS 238 • X→Y • X←Y X may influence Y given evidence Z under certain conditions, depending on whether or not node W is part of the evidence Z: • • • • X X X X →W ←W ←W →W →Y, ←Y, →Y, ←Y, if if if if W ∈ /Z W ∈ /Z W ∈ / inZ either W ∈ Z or one of W ’s descendants ∈ Z (intercausal reasoning) A trail X1 − · · · − Xk is active given evidence Z if, for any v-structure Xi−1 → Xi ← Xi+1 we have that Xi or one of its descendants is in Z and no other Xi (not in v-structures) is in Z. 9.2.4 Independence For events α, β, we say P satisfies the independence of α and β, notated P ⊨ α ⊥ β if: • P (α, β) = P (α)P (β) • P (α|β) = P (α) • P (β|α) = P (β) This can be generalized to random variables: X, Y, P ⊨ X ⊥ Y if: • P (X, Y ) = P (X)P (Y ) • P (X|Y ) = P (X) • P (Y |X) = P (Y ) 9.2.5 Conditional independence For (sets of) random variables X, Y, Z, P ⊨ (X ⊥ Y |Z) if: • • • • P (X, Y |Z) = P (X|Z)P (Y |Z) P (X|Y, Z) = P (X|Z) P (Y |X, Z) = P (Y |Z) P (X, Y, Z) ∝ ϕ1 (X, Z)ϕ2 (Y, Z); that is, the probability of the joint distribution P (X, Y, Z) is proportional to a product of the two factors ϕ1 (X, Z) and ϕ2 (Y, Z) For example: There are two coins, one is fair and one is biased to show heads 90% of the time. 238 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 239 9.2. BELIEF (BAYESIAN) NETWORKS Coin toss example You pick a coin, toss it, and it comes up heads. The probability of heads is higher in the second toss. You don’t know what coin you have but heads on the first toss makes it more likely that you have the bias coin, thus a higher chance of heads on the second toss. So X1 and X2 are not independent. But if you know what coin you have, the tosses are then independent; the first toss doesn’t tell you anything about the second anymore. That is, X1 ⊥ X2 |C. But note that conditioning can also lose you independence. For example, using the previous student example, I ⊥ D, but if we condition on grade G, they are no longer independent (this is the same as the OR gate example). Student example We say that X and Y are d-separated in G given Z if there is no active trail in G between X and Y given Z. This is notated d-sepG (X, Y |Z). If P factorizes over G and d-sepG (X, Y |Z), then P satisfies X ⊥ Y |Z). Any node is d-separated from its non-descendants given its parents. So if a distribution P factorizes over G, then in P , any variable is independent of its non-descendants given its parents. We can notate the set of independencies implicit in a graph G, that is, all of the independence statements that correspond to d-separation statements in the graph G, as I(G): I(G) = {(X ⊥ Y |Z)|d-sepG (X, Y |Z)} If P satisfies I(G), then we say that G is an I-map (independency map) of P . CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 239 9.2. BELIEF (BAYESIAN) NETWORKS 240 This does not mean G must imply all independencies in P , just that those that it does imply are in fact present in P . SO if P factorizes over G, then G is an I-map for P . The converse also holds: if G is an I-map for P , then P factorizes over G. 9.2.6 Template models Within a model you may have structures which repeat throughout or you may want to reuse common structures between/across models. In these cases we may use template variables. A template variable X(U1 , . . . , Uk ) is instantiated multiple times. U1 , . . . , Uk are the arguments. A template model is a language which specifies how “ground” variables inherit dependency models from templates. 9.2.7 Temporal models A common example of template models are temporal models, used for systems which evolve over time. When representing a distribution over continuous time, you typically want to discretize time so that it is not continuous. To do this, you pick a time granularity ∆. We also have a set of template variables. X (t) describes an instance of a template variable X at time t∆. ′ ′ X (t:t ) = {X (t) , . . . , X (t ) }wheret ≤ t ′ ′ That is, X (t:t ) denotes the set of random template variables that spans these time points. ′ We want to represent P (X (t:t ) ) for any t, t ′ . To simplify this, we can use the Markov assumption, a type of conditional independence assumption. Without this assumption, we have: P (X (0:T ) = P (X (0) ) T∏ −1 P (X (t+1) |X (0:t) ) t=0 (this is just using the chain rule for probability) Then the Markov assumption is (X (t+1) ⊥ X (0:t−1) |X (t) ). That is, any time point is independent of the past, given the present. So then we can simplify our distribution: 240 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 241 9.2. BELIEF (BAYESIAN) NETWORKS P (X (0:T ) = P (X (0) ) T∏ −1 P (X (t+1) |X (t) ) t=0 The Markov assumption isn’t always appropriate, or it may be too strong. You can make it a better approximation by adding other variables about the state, in addition to X (t) . The second assumption we make is of time invariance. We use a template probability model P (X ′ |X) where X ′ denotes the next time point and X denotes the current time point. We assume that this model is replicated for every single time point. That is, for all t: P (X (t+1) |X (t) ) = P (X ′ |X) That is, the probability distribution is not influenced by the time t. Again, this is an approximation and is not always appropriate. Traffic, for example, has a different dynamic depending on what time of day it is. Again, you can include extra variables to capture other aspects of the state of the world to improve the approximation. Temporal model example (transition model) Temporal model example • W = weather • V = velocity • L = location CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 241 9.2. BELIEF (BAYESIAN) NETWORKS 242 • F = failure • O = observation The left column of the graph is at time slice t, and the right side is at time slice t + 1. The edges connecting the nodes at t to the nodes at t + 1, e.g. F → F ′ , is an inter-time-slice, and the edges connecting nodes at t + 1 to the observation, e.g. F ′ → O′ , are intra-time-slices. We can describe a conditional probability distribution (CPD) for our prime variables as such: P (W ′ , V ′ , L′ , F ′ , O′ |W, V, L, F ) We don’t need a CPD for the non-prime variables because they have already “happened”. We can rewrite this distribution with the independence assumptions in the graph: P (W ′ , V ′ , L′ , F ′ , O ′ |W, V, L, F ) = P (W ′ |W )P (V ′ |W, V )P (L′ |L, V )P (F ′ |F, W )P (O′ |L′ , F ′ ) Here the observation O′ is conditioned on variables in the same time slice (L′ , F ′ ) because we assume the observation is “immediate”. This is a relation known as an intra-time-slice. All the other variables are conditioned on the previous time slice, i.e. they are inter-time-slice relations. Now we start with some initial state (time slice 0, t0 ): Temporal model example initial state Then we add on the next time slice, t1 : And we can repeatedly do this to represent all subsequent time slices t2 , . . . , where each is conditioned on the previous time slice. So we have a 2-time-slice Bayesian network (2TBN). A transition model (2TBN) over X1 , . . . , Xn is specified as a BN fragment such that: • the nodes include X1′ , . . . , Xn′ (next time slice t + 1) and a subset of X1 , . . . , Xn (time slice t). • only the nodes X1′ , . . . , Xn′ have parents and a CPD 242 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 243 9.2. BELIEF (BAYESIAN) NETWORKS Temporal model example at t1 The 2TBN defines a conditional distribution using the chain rule: P (X ′ |X) = n ∏ P (Xi′ |Pa(Xi′ )) i=1 9.2.8 Markov Models We can consider a Markov model as a chain-structured Bayes’ Net, so our reasoning there applies here as well. Each node is a state in the sequence and each node is identically distributed (stationary) and depends on the previous state, i.e. P (Xt |Xt−1 ) (except for the initial state P (X1 )). This is essentially just a conditional independence assumption (i.e. that P (Xt ) is conditionally independent of Xt−2 , Xt−3 , . . . , X1 given Xt−1 ). The parameters of a Markov model are the transition probabilities (or dynamics) and the initial state probabilities (i.e. the initial distribution P (X1 ). Say we want to know P (X) at time t. A Markov model algorithm for solving this is the forward algorithm, which is just an instance of variable elimination (in the order X1 , X2 , . . . ). A simplified version: ∑ P (xt ) = P (xt |xt−1 )P (xt−1 ) xt−1 Assuming P (x1 ) is known. P (Xt ) converges as t → ∞, and it converges to the same values regardless of the initial state. This converged distribution, independent of the initial state, is called the stationary distribution. The influence of the initial state fades away as t → ∞. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 243 9.2. BELIEF (BAYESIAN) NETWORKS 244 The key insight for a stationary distribution is that P (Xt ) = P (Xt−1 ), and that this is independent of the initial distribution. Formally, the stationary distribution satisfies: P∞ (X) = P∞+1 (X) = ∑ Pt+1|t (X|x)P∞ (x) x 9.2.9 Dynamic Bayes Networks (DBNs) A dynamic Bayes’ net (DBN) is a Bayes’ net replicated through time, i.e. variables at time t can be conditioned on those from time t − 1 (the structure is reminiscent of a recurrent neural network). A dynamic Bayesian network over X1 , . . . , Xn is defined by a: • 2TBN BN→ over X1 , . . . , Xn • a Bayesian network BN(0) over X1(0) , . . . , Xn(0) (time 0, i.e. the initial state) Ground network For a trajectory over 0, . . . , T , we define a ground (unrolled network) such that: • the dependency model for X1(0) , . . . , Xn(0) is copied from BN(0) • the dependency model for X1(t) , . . . , Xn(t) for all t > 0 is copied from BN→ That is, it is just an aggregate (“unrolled”) of the previously shown network up to time slice tT . Hidden Markov Models (HMMs) Often we have a sequence of observations and we want to use these observations to learn something about the underlying process that generated them. As such we need to introduce time or space to our models. An Hidden Markov Model (HMM) is a simple dynamic Bayes’ net. In particular, it is a Hidden Markov Model Markov model in which we don’t directly observe the state. That is, there is a Markov chain where we don’t see St but rather we see some evidence/observations/emissions/outputs/effects/etc Ot . The actual observations are stochastic (e.g. an underlying state may produce one of many observations with some probability). We try to infer the state based on these observations. 244 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 245 9.2. BELIEF (BAYESIAN) NETWORKS For example, imagine we are in a windowless room and we want to know if it’s raining. We can’t directly observe whether it’s raining, but we can see if people have brought umbrellas with them. It is also a 2TBN. HMMs are used to analyze or to predict time series involving noise or uncertainty. There is a sequence of states s1 → s2 → s3 → · · · → SN . This sequence is a Markov chain (each state depends only on the previous state). Each state emits a measurement/observation, e.g. s1 emits z1 (s1 → z1 ), s2 emits z2 (s2 → z2 ), and so on. We don’t deserve the states directly; we only observe these measurements (hence, the underlying Markov model is “hidden”). Together, these define a Bayes network that is at the core of HMMs. An HMM is defined by: • • • • a state variable S and an observation (sometimes called emission) variable O the initial distribution P (S0 ) the transition model P (S ′ |S) the observation model P (O|X) (the probability of seeing evidence given the hidden state, also called an emissions model) We introduce an additional conditional independence assumption - that the current observation is independent of everything else given the current state. Basic HMM You can unroll this: Basic HMM, unrolled CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 245 9.2. BELIEF (BAYESIAN) NETWORKS 246 HMMs, however, may also have internal structures, more commonly in the transition model, but sometimes in the observation model as well. TODO in the following X is switched with S, make it consistent Example Say we have the following HMM: Hidden Markov Model Example We don’t know the starting state, but we know the probabilities: 1 2 1 P (S0 ) = 2 P (R0 ) = Say on the first day we see that this person is happy and we want to know whether or not it is raining. That is: P (R1 |H1 ) We can use Bayes’ rule to compute this posterior: P (R1 |H1 ) = P (H1 |R1 )P (R1 ) P (H1 ) We can compute these values by hand: P (R1 ) = P (R1 |R0 )P (R0 ) + P (R1 |S0 )P (S0 ) P (H1 ) = P (H1 |R1 )P (R1 ) + P (H1 |S1 )P (S1 ) P (H1 |R1 ) can be pulled directly from the graph. Then you can just run the numbers. 246 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 247 9.2. BELIEF (BAYESIAN) NETWORKS Inference base cases in an HMM The first base case: consider the start of an HMM: P (X1 ) → P (E1 |X1 ) Inferring P (X1 |e1 ), that is, P (X1 ) given we observe a piece of evidence e1 , is straightforward: P (x1 , e1 ) P (e1 ) P (e1 |x1 )P (x1 ) = P (e1 ) P (x1 |e1 ) = ∝X1 P (e1 |x1 )P (x1 ) That is, we applied the definition of conditional probability and then expanded the numerator with the product rule. For an HMM, P (E1 |X1 ) and P (X1 ) are specified, so we have the information needed to compute this. We just compute P (e1 |X1 )P (X1 ) and normalize the resulting vector. The second base case: Say we want to infer P (X2 ), and we just have the HMM: X1 → X2 That is, rather than observing evidence, time moves forward one step. For an HMM, P (X1 ) and P (X2 |X1 ) are specified. So we can compute P (X2 ) like so: P (x2 ) = ∑ P (x2 , x1 ) x1 = ∑ P (x2 |x1 )P (x1 ) x1 From these two base cases we can do all that we need with HMMs. Passage of time Assume that we have the current belief P (X|evidence to date): B(Xt ) = P (Xt |e1:t ) After one time step passes, we have: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 247 9.2. BELIEF (BAYESIAN) NETWORKS 248 P (Xt+1 |e1:t ) = ∑ P (Xt+1 |xt )P (xt |e1:t ) xt Which can be written compactly as: B ′ (Xt+1 ) = ∑ P (X ′ |x)B(xt ) xt Intuitively, what is happening here is: we look at each place we could have been, xt , consider how likely it was that we were there to begin with, B(xt ), and multiply it by the probability of getting to X ′ had you been there. Observing evidence Assume that we have the current belief P (X|previous evidence): B ′ (Xt+1 ) = P (Xt+1 |e1:t ) Then: P (Xt+1 |e1:t+1 ) ∝ P (et+1 |Xt+1 )P (Xt+1 |e1:t ) See the above base case for observing evidence - this is just that, and remember, renormalize afterwards. Another way of putting this: B(Xt+1 ) ∝ P (e|X)B ′ (Xt+1 ) The Forward Algorithm Now we can consider the forward algorithm (the one presented previously was a simplification). We are given evidence at each time and want to know: Bt (X) = P (Xt |e1:t ) We can derive the following updates: 248 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 249 9.2. BELIEF (BAYESIAN) NETWORKS P (xt |e1:t ) ∝X P (xt , e1 : t) = ∑ P (xt−1 , xt , e1:t ) xt−1 = ∑ P (xt−1 , e1:t−1 )P (xt |xt−1 )P (et |xt ) xt−1 = P (et |xt ) ∑ P (xt−1 , e1:t−1 ) xt−1 Which we can normalize at each step (if we want P (x|e) at each time step) or all together at the end. This is just variable elimination with the order X1 , X2 , . . . . This computation is proportional to the square number of states. Most Likely Explanation With Most Likely Explanation, the concern is not the state at time t, but the most likely sequence of states that led to time t, given observations. For MLE, we use an HMM and instead we want to know: argmax P (x1:t |e1:t ) x1:t We can use the Viterbi algorithm to solve this, which is essentially just the forward algorithm where ∑ the is changed to a max: mt [xt ] = max P (x1:t−1 , xt , e1:t ) x1:t−1 = P (et |xt ) max P (xt |xt−1 )mt−1 [xt−1 ] xt−1 In contrast, the forward algorithm: ft [xt ] = P (xt , e1:t ) = P (et |xt ) ∑ P (xt |xt−1 )ft−1 [xt−1 ] xt−1 9.2.10 Plate models A common template model is a plate model. Say we are repeatedly flipping a coin. The surrounding box is the plate. The idea is that these are “stacked”, one for each toss t. That is, they are indexed by t. The θ node denotes the CPD parameters. This is outside the plate, i.e. it is not indexed by t. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 249 9.2. BELIEF (BAYESIAN) NETWORKS 250 Simple plate model Simple plate model, alternative representation 250 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 251 9.2. BELIEF (BAYESIAN) NETWORKS Another way of visualizing this: Where o(ti ) is the outcome at time ti . This representation makes it more obvious that each of these plates is a copy of a template. Another example: A plate model for students Plates may be nested: If we were to draw this out for two courses and two students: One oddity here is that now intelligence depends on both the student s and the course c, whereas before it depends only on the student s. Maybe this is desired, but let’s say we want what we had before. That is, we want intelligence to be independent of the course c. Instead, we can use overlapping plates: Plate models allow for collective inference, i.e. they allow us to look at the aggregate of these individual instances in order to find broader patterns. More formally, a plate dependency model: For a template variable A(U1 , . . . , Uk ) we have template parents B1 (U1 ), . . . , Bm (Um ); that is, an index cannot appear in the parent which does not appear in the child. This is a particular limitation of plate models. We get the following template CPD: P (A|B1 , . . . , Bm ). 9.2.11 Structured CPDs We can represent CPDs in tables, e.g. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 251 9.2. BELIEF (BAYESIAN) NETWORKS 252 A nested plate model for students Unrolled student plate model 252 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 253 9.2. BELIEF (BAYESIAN) NETWORKS An overlapping plate model for students g1 g2 g3 i0 , d 0 i0 , d 1 i1 , d 0 i1 , d 1 But as we start to have more variables, this table can explode in size. More generally, we can just represent a CPD P (X|Y1 , . . . , Yk ), which specifies a distribution over X for each assignment Y1 , . . . , Yk using any function which specifies a factor ϕ(X, Y1 , . . . , Yk ) such that: ∑ ϕ(x, y1 , . . . , yk ) = 1 x for all y1 , . . . , yk . There are many models for representing CPDs, including: • • • • • deterministic CPDs tree-structured CPDs logistic CPDs and generalizations noisy OR/AND linear Gaussians and generalizations Context-specific independence shows up in some CPD representations. It is a type of independence where we have a particular assignment c, from some set of variables C, P ⊨ (X ⊥c Y |Z, c) Which is to say this independence holds only for particular values of c, rather than all values of c. For example, consider: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 253 9.2. BELIEF (BAYESIAN) NETWORKS 254 • X ⊥ Y1 |y20 . When Y2 is false, X just takes on the value of Y1 , so there’s no context-specific independence here. • X ⊥ Y1 |y21 . When Y2 is true, then it doesn’t matter what value Y1 takes, since X will be true too. Thus we have context-specific independence. • Y1 ⊥ Y2 |x 0 . If we know X is false, we already know Y1 , Y2 are false, independent of each other. So we have context-specific independence here. • Y1 ⊥ Y2 |x 1 . We don’t have context-specific independence here. Tree-structured CPDs Say we have the following model: Simple model That is, whether or not a student gets a job J depends on: • A - if they applied (+a, −a) • L - if they have a letter of recommendation (+l, −l) • S - if they scored well on the SAT (+s, −s) We can represent the CPD as a tree structure. Note that the notation at the leaf nodes is the probability of not getting the job and of getting it, i.e. (P (−j), P (+j). A bit more detail: we’re assuming its possible that the student gets the job without applying, e.g. via a recruiter, in which case the SAT score and letter aren’t important. We also assume that if the student scored well on the SAT, the letter is unimportant. We have three binary random variables. If we represented this CPD as a table, it have 23 = 8 conditional probability distributions. However, in certain contexts we only need 4 distributions since we have some context-specific independences: • J ⊥c L| + a, +s • J ⊥c L, S| − a • J ⊥c L| + s, A This last one is just a compact representation of: 254 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 255 9.2. BELIEF (BAYESIAN) NETWORKS Tree-structured CPD • J ⊥c L| + s, +a • J ⊥c L| + s, −a Consider another model: Another model Where the student chooses only one letter to submit. The tree might look like: Here the choice variable C determines the dependence of one set of circumstances on another set of circumstances. This scenario has context-specific independence but also non-context-specific independence: L1 ⊥ L2 |J, C Because, if you break it down into its individual cases: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 255 9.2. BELIEF (BAYESIAN) NETWORKS 256 Tree-structured CPD • L1 ⊥c L2 |J, c1 • L1 ⊥c L2 |J, c2 both are true. This scenario relates to a class of CPDs called multiplexer CPDs: Multiplexer CPD Y has two lines around it to indicate deterministic dependence. Here we have some variables Z1 , . . . , Zk and A is a copy of one of these variables. A is the multiplexer, i.e. the “selector variable”, taking a value from {1, . . . , k}. For a multiplexer CPD, we have: 1 P (Y |A, Z1 , . . . , Zk ) = Y = Za 0 otherwise That is, the value of A just determines which Z value Y takes on. 256 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 257 9.2. BELIEF (BAYESIAN) NETWORKS Noisy OR CPD Noisy OR CPDs In a noise OR CPD we introduce intermediary variables between xi and Y . These intermediary variables z take on 1 if its parent value satisfies its criteria. Y becomes an OR variable which is true if any of the z variables are true. That is: 0 P (zi = 1|xi ) = λi if xi = 0 if xi = 1 Where λi ∈ [0, 1]. So if xi = 0, zi never gets turned on. If xi = 1, zi gets turned on with probability lambdai . z0 is a “leak” probability which is the probability that Y gets turned on by itself. P (z0 = 1) = λ0 . We can write this as a probability and consider the CPD of Y = 0 given our x variables. That is, what is the probability that all the x variables fail to turn on their corresponding z variables? P (Y = 0|X1 , . . . , Xk ) = (1 − λ0 ) ∏ (1 − λi ) i:xi =1 where (1 − λ0 ) is the probability that Y doesn’t get turned on by the leak. Thus: P (Y = 1|X1 , . . . , Xk ) = 1 − P (Y = 0|X1 , . . . , Xk ) A noisy OR CPD demonstrates independence of causal influence. We are assuming that we have a bunch of causes x1 , . . . , xk for a variable Y , which each act independently to affect the truth of the Y . That is, there is no interaction between the causes. Other CPDs for independence of causal influence include noisy AND, noisy MAX, etc. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 257 9.2. BELIEF (BAYESIAN) NETWORKS 258 Continuous variables Consider: Continuous variable example (simple) We have the temperature in a room and a sensor which measures the temperature. The sensor is not perfect, so it usually around the right temperature, but not exactly. We can represent this by saying the sensor reading S is a normal distribution around the true temperature T with some standard deviation, i.e.: S ∼ N(T ; σS2 ) This model is a linear Gaussian. We can make it more complex, assuming that the outside temperature will also affect the room temperature: Continuous variable example (more complex) Where T ′ is the temperature in a few moments and O is the outside temperature. We may say that T ′ is also a linear Gaussian: T ′ ∼ N(αT + (1 − α)O; σT2 ) The αT + (1 − α)O term is just a mixture of the current temperature and the outside temperature. We can take it another step. Say there is a door D in the room which is either opened or closed (i.e. it is a binary random variable). Now T ′ is described as: 258 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 259 9.2. BELIEF (BAYESIAN) NETWORKS Continuous variable example (even more complex) 2 T ′ ∼ N(α0 T + (1 − α0 )O; σ0T )if D = 0 2 )if D = 1 T ′ ∼ N(α1 T + (1 − α1 )O; σ1T This is a conditional linear Gaussian model since its parameters are conditioned on the discrete variable D. Generally, a linear Gaussian model looks like: Basic linear Gaussian Y ∼ N(w0 + ∑ wi X i ; σ 2 ) ∑ Where w0 + wi Xi is the mean (a linear function of the parents) and σ 2 is not related to the parents/doesn’t depend on the parents. Then, conditional linear Gaussians introduce one or more discrete parents (only one, A, is depicted below), and this is just a linear gaussian whose parameters depend on the value of A: Y ∼ N(wa0 + CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS ∑ wai Xi ; σa2 ) 259 9.2. BELIEF (BAYESIAN) NETWORKS 9.2.12 260 Querying Bayes’s nets Conditional probability queries PGMs can be used to answer many queries, but the most common is probably conditional probability queries: Given evidence e about some variables E, we have a query which is a subset of variables Y , and our task is to compute P (Y |E = e). Unfortunately, the problem of inference on graphical models is NP-Hard. In particular, the following are NP-Hard: • • • • • exact inference given a PGM PΦ , a variable X and a value x ∈ Val(X), compute PΦ (X = x) even just deciding if PΦ (X = x) > 0 is NP-hard approximate inference let ϵ < 0.5. Given a PGM PΦ , a variable X, a value x ∈ Val(X), and an observation e ∈ Val(E), find a number p that has |PΦ (X = x|E = e) − p| < ϵ. However, NP-Hard is the worst case result and their are algorithms that perform for most common cases. Some conditional probability inference algorithms: • • • • • • • variable elimination message passing over a graph belief propagation variational approximations random sampling instantiations Markov Chain Monte Carlo (MCMC) importance sampling MAP (maximum a posteriori) queries PGMs can also answer MAP queries: We have a set of evidence E = e, the query is all other variables Y , i.e. Y = {X1 , . . . , Xn } − E. Our task is to compute MAP(Y |E = e) = argmaxy P (Y = y |E = e). There may be more than one possible solution. This is also a NP-hard problem, but there are also many algorithms to solve these efficiently for most cases. Some MAP inference algorithms: • variable elimination 260 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 261 • • • • • 9.2. BELIEF (BAYESIAN) NETWORKS message passing over a graph max-product belief propagation using methods for integer programming for some networks, graph-cut methods combinatorial search 9.2.13 Inference in Bayes’ nets Given a query, i.e. a joint probability distribution we are interested in getting a value for, we can infer an answer for that query from a Bayes’ net. The simplest approach is inference by enumeration in which we extract the conditional probabilities from the Bayes’ net and appropriately combine them together. But this is very inefficient, especially because variables that aren’t in the query require us to enumerate over all possible values for them. We lose most of the benefit of having this compact representation of joint distributions. An alternative approach is variable elimination, which is still NP-hard, but faster than enumeration. Variable elimination requires the notion of factors. Here are some factors: • a joint distribution: P (X, Y ), which is just all entries P (x, y ) for all x, y and sums to 1. Example: P (T, W ) T W P hot sun 0.4 hot rain 0.1 cold sun 0.2 cold rain 0.3 • a selected joint: P (x, Y ), i.e. we fix X = x, then look at all entries P (x, y ) for all y , and sums to P (x). This is a “slice” of the joint distribution. Example: P (cold, W ) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 261 9.2. BELIEF (BAYESIAN) NETWORKS 262 T W P cold sun 0.2 cold rain 0.3 • a single conditional: P (Y |x), i.e. we fix X = x, then look at all entries P (y |x) for all y , and sums to 1. Example: P (W |cold) T W P cold sun 0.4 cold rain 0.6 • a family of conditionals: P (X, Y ), i.e. we have multiple conditions, all entries P (x|y ) for all x, y , and sums to |Y |. Example: P (W |T ) T W P hot sun 0.8 hot rain 0.2 cold sun 0.4 cold rain 0.6 • a specified family: P (y |X), i.e. we fix y and look at all entries P (y |x) for all x. Can sum to anything; Example: P (rain|T ) 262 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 263 9.2. BELIEF (BAYESIAN) NETWORKS T W P hot rain 0.2 cold rain 0.6 In general, when we write P (Y1 , . . . , YN |X1 , . . . , XM ), we have a factor, i.e. a multi-dimensional array for which the values are all instantiations P (y1 , . . . , yN |x1 , . . . , xM ). Any assigned/instantiated X or Y is a dimension missing (selected) from the array, which leads to smaller factors - when we fix values, we don’t have to consider every possible instantiation of that variable anymore, so we have less possible combinations of variable values to consider. For example, if X and Y are both binary random variables, if we don’t fix either of them we have four to consider ((X = 0, Y = 0), (X = 1, Y = 0), (X = 0, Y = 1), (X = 1, Y = 1)) . If we fix, say X = 1, then we only have two to consider ((X = 1, Y = 0), (X = 1, Y = 1)). Consider a simple Bayes’ net: R→T →L Where R is whether or not it is raining, T is whether or not there is traffic, and L is whether or not we are late for class. We are given the following factors for this Bayes’ net: P (R) R P +r 0.1 -r 0.9 P (T |R) R T P +r +t 0.8 +r -t 0.2 -r +t 0.1 -r -t 0.9 P (L|T ) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 263 9.2. BELIEF (BAYESIAN) NETWORKS 264 T L P +t +l 0.3 +t -l 0.7 -t +l 0.1 -t -l 0.9 For example, if we observe L = +l, so we can fix that value and shrink the last factor P (L|T ): P (+l|T ) T L P +t +l 0.3 -t 0.1 +l We can join factors, which gives us a new factor over the union of the variables involved. For example, we can join on R, which involves picking all factors involving R, i.e. P (R) and P (T |R), giving us P (R, T ). The join is accomplished by computing the entry-wise products, e.g. for each r, t, compute P (r, t) = P (r )P (t|r ): P (R, T ) R T P +r +t 0.08 +r -t -r +t 0.09 -r -t 0.02 0.81 After completing this join, the resulting factor P (R, T ) replaces P (R) and P (T |R), so our Bayes’ net is now: (R, T ) → L We can then join on T , which involves P (L|T ) and P (R, T ), giving us P (R, T, L): P (R, T, L) R 264 T L P +r +t +l 0.024 +r +t -l 0.056 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 265 9.2. BELIEF (BAYESIAN) NETWORKS R T L P +r -t +l 0.002 +r -t -l 0.018 -r +t +l 0.027 -r +t -l 0.063 -r -t +l 0.081 -r -t -l 0.729 Now we have this joint distribution, and we can use the marginalization operation (also called elimination) on this factor - that is, we can sum out a variable to shrink the factor. We can only do this if the variable appears in only one factor. For example, say we still had our factor P (R, T ) and we wanted to get P (T ). We can do so by summing out R: P (T ) T P +t 0.17 -t 0.83 So we can take our full joint distribution P (R, T, L) and get P (T, L) by elimination (in particular, by summing out R): P (T, L) T L P +t +l 0.051 +t -l 0.119 -t +l 0.083 -t -l 0.747 Then we can further sum out T to get P (L): P (L) L P +l 0.134 -l CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 0.866 265 9.2. BELIEF (BAYESIAN) NETWORKS 266 This approach is equivalent to inference by enumeration (building up the full joint distribution, then taking it apart to get to the desired quantity). However, we can use these operations (join and elimination) to find “shortcuts” to the desired quantity (i.e. marginalize early without needing to build the entire joint distribution first). This method is variable elimination. For example, we can compute P (L) in a shorter route: • • • • join on R, as before, to get P (R, T ) then eliminate (sum out) R from P (R, T ) to get P (T ) then join on T , i.e. with P (T ) and P (L|T ), giving us P (T, L) the eliminate T , giving us P (L) In contrast, the enumeration method required: • • • • join on R to get P (R, T ) join on T to get P (R, T, L) eliminate R to get P (T ) eliminate T to get P (L) The advantage of variable elimination is that we never build a factor of more than two variables (i.e. the full joint distribution P (R, T, L)), thus saving time and space. The largest factor typically has the greatest influence over the computation complexity. In this case, we had no evidence (i.e. no fixed values) to work with. If we had evidence, we would first shrink the factors involving the observed variable, and the evidence would be retained in the final factor (since we can’t sum it out once it’s observed). For example, say we observed R = +r . We would take our initial factors and shrink those involving R: P (+r ) R P +r 0.1 P (T | + r ) R T P +r +t 0.8 +r -t 0.2 And we would eventually end up with: 266 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 267 9.2. BELIEF (BAYESIAN) NETWORKS P (+r, L) R L P +r +l 0.026 +r -l 0.074 And then we could get P (L| + r ) by normalizing P (+r, L): P (L| + r ) L P +l 0.26 -l 0.74 More concretely, the general variable elimination algorithm is such: • start with a query P (Q|E1 = e1 , . . . , Ek = ek ), where Q are your query variables • start with initial factors (i.e. local conditional probability tables instantiated by the evidence E1 , . . . , Ek , i.e. shrink factors involving the evidence) • while there are still hidden variables (i.e. those in the net that are not Q or any of the evidence E1 , . . . , Ek ) • pick a hidden variable H • join all factors mentioning H • eliminate (sum out) H • then join all remaining factors and normalize. The resulting distribution will be P (Q|e1 , . . . , ek ). The order in which you eliminate variables affects computational complexity in that some orderings generate larger factors than others. Again, the factor size is what influences complexity, so you want to use orderings that produce small factors. For example, if a variable is mentioned in many factors, you generally want to avoid computing that until later on (usually last). This is because a variable mentioned in many factors means joining over many factors, which will probably produce a very large factor. We can encode this in the algorithm by telling it to choose the next hidden variable that would produce the smallest factor (since factor sizes are relatively easy to compute without needing to actually produce the factor, just look at the number and sizes of tables that would have to be joined). Unfortunately there isn’t always an ordering with small factors, so variable elimination is great in many situations, but not all. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 267 9.2. BELIEF (BAYESIAN) NETWORKS 268 Sampling Another method for Bayes’ net inference is sampling. This is an approximate inference method, but it can be much faster. Here, “sampling” essentially means “repeated simulation”. The basic idea: • draw N samples from a sampling distribution S • compute an approximate posterior probability • with enough samples, this converges to the true probability P Sampling from a given distribution: 1. Get sample u from a uniform distribution over [0, 1] 2. Convert this sample u into an outcome for the given distribution by having each outcome associated with a sub-interval of [0, 1) with sub-interval size equal to the probability of the outcome For example, if we have the following distribution: C P(C) red 0.6 green 0.1 blue 0.3 Then we can map u to C in this way: red if0 ≤ u < 0.6 c = green if0.6 ≤ u < 0.7 blue if0.7 ≤ u < 1 There are many different sampling strategies for Bayes’ nets: • • • • prior sampling rejection sampling likelihood weighting Gibbs sampling In practice, you typically want to use either likelihood weighting or Gibbs sampling. 268 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 269 9.2. BELIEF (BAYESIAN) NETWORKS Prior sampling We have a Bayes’ net, and we want to sample the full joint distribution it encodes, but we don’t want to have to build the full joint distribution. Imagine we have the following Bayes’ net: P (C) → P (R|C) P (C) → P (S|C) P (R|C) → P (W |S, R) P (S|C) → P (W |S, R) Where C, R, S, W are binary variables (i.e. C can be +c or −c). We start from P (C) and sample a value c from that distribution. Then we sample r from P (R|C) and s from P (S|C) conditioned on the value c we sampled from P (C). Then we sample from P (W |S, R) conditioned on the sampled r, s values. Basically, we walk through the graph, sampling from the distribution at each node, and we choose a path through the graph such that we can condition on previously-sampled variables. This generates one final sample across the different variables. If we want more samples, we have to repeat this process. Prior sampling (SP S ) generates samples with probability: SP S (x1 , . . . , xn ) = n ∏ P (xi |Parents(Xi )) = P (x1 , dots, xn ) i=1 That is, it generates samples from the actual joint distribution the Bayes’ net encodes, which is to say that this sampling procedure is consistent. This is worth mentioning because this isn’t always the case; some sampling strategies sample from a different distribution and compensate in other ways. Then we can use these samples to estimate P (W ) or other quantities we may be interested in, but we need many samples to get good estimates. Rejection sampling Prior sampling can be overkill, since we typically keep samples which are irrelevant to the problem at hand. We can instead use the same approach but discard irrelevant samples. For instance, if we want to compute P (W ), we only care about values that W takes on, so we don’t need to keep the corresponding values for C, S, R. Similarly, maybe we are interested in P (C| + s) so we should only be keeping samples where S = +s. This method is called rejection sampling because we are rejecting samples that are irrelevant to our problem. This method is also consistent. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 269 9.3. MARKOV NETWORKS 270 Likelihood Weighting A problem with rejection sampling is that if the evidence is unlikely, we have to reject a lot of samples. For example, if we wanted to estimate P (C| + s) and S = +s is generally very rare, then many of our samples will be rejected. We could instead fix the evidence variables, i.e. when it comes to sample S, just say S = +s. But then our sample distribution is not consistent. We can fix this by weighting each sample by the probability of the evidence (e.g. S = +s) given its parents (e.g. P (+s|Parents)). Gibbs sampling With likelihood weighting, we consider the evidence only for variables sampled after we fixed the evidence (that is, that come after the evidence node in our walk through the Bayes’ net). Anything we sampled before did not take the evidence into account. It’s possible that what we sample before we get to our evidence is very inconsistent with the evidence, i.e. makes it very unlikely and gives us a very low weight for our sample. With Gibbs sampling, we fix our evidence and then instantiate of all our other variables, x1 , . . . , xn . This instantiation is arbitrary but it must be consistent with the evidence. Then, we sample a new value for one variable at a time, conditioned on the rest, though we keep the evidence fixed. We repeat this many times. If we repeat this infinitely many times, the resulting sample comes from the correct distribution, and it is conditioned on both the upstream (pre-evidence) and downstream (post-evidence) variables. Gibbs sampling is essentially a Markov model (hence it is a Markov Chain Monte Carlo method) in which the stationary distribution is the conditional distribution we are interested in. 9.3 Markov Networks Markov networks are also called Markov random fields. The simplest subclass is pairwise Markov networks. Say we have the following scenario: Simple pairwise Markov network 270 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 271 9.3. MARKOV NETWORKS An idea is floating around and when, for example, Alice & Bob are hanging out, they may share the idea - they influence each other. We don’t use a directed graph because the influence flows in both directions. But how do you parametrize an undirected graph? We no longer have a notion of a conditional that is, one variable conditioning another. Well, we can just use factors: Simple pairwise Markov network ϕ1 [A, B] −a, −b 30 −a, +b 5 +a, −b 1 +a, +b 10 These factors are sometimes called affinity functions or compatibility functions or soft constraints. What do these numbers mean? They indicate the “local happiness” of the variables A and B to take a particular joint assignment. Here A and B are “happiest” when −a, −b. We can define factors for the other edges as well: ϕ2 [B, C] −b, −c 100 −b, +c 1 +b, −c 1 +b, +c 100 ϕ3 [C, D] −c, −d CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 1 271 9.3. MARKOV NETWORKS 272 ϕ3 [C, D] −c, +d 100 +c, −d 100 +c, +d 1 ϕ4 [D, A] −d, −a 100 −d, +a 1 +d, −a 1 +d, +a 100 Then we have: P̃ (A, B, C, D) = ϕ1 (A, B)ϕ2 (B, C)ϕ3 (C, D)ϕ4 (A, D) This isn’t a probability distribution because its numbers aren’t in [0, 1] (hence the tilde over P , which indicates an unnormalized measure). We can normalize it to get a probability distribution: P (A, B, C, D) = 1 P̃ (A, B, C, D) Z Z is known as a partition function. There unfortunately is no natural mapping from the pairwise factors and the marginal probabilities from the distribution they generate. For instance, say we are given the marginal probabilities of PΦ (A, B) (the Φ indicates the probability was computed using a set of factors Φ = {ϕ1 , . . . , ϕn }): 272 A B PΦ (A, B) −a −b 0.13 −a +b 0.69 +a −b 0.14 +a +b 0.04 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 273 9.3. MARKOV NETWORKS ϕ1 [A, B] −a, −b 30 −a, +b 5 +a, −b 1 +a, +b 10 The most likely joint assignment is −a, +b, which doesn’t seem to correspond to the factor. This is a result of the other factors in the network. This is unlike Bayesian networks where the nodes were just conditional probabilities. Formally, a pairwise Markov network is an undirected graph whose nodes are X1 , . . . , Xn and each edge Xi − Xj is associated with a factor (aka potential) ϕij (Xi − Xj ). Pairwise Markov networks cannot represent all of the probability distributions we may be interested in. A pairwise Markov network with n random variables, each with d values, has O(n2 d 2 ) parameters. On the other hand, if we consider a probability distribution over n random variables, each with d values, it has O(d n ) parameters, which is far greater than O(n2 d 2 ). Thus we generalize beyond pairwise Markov networks. 9.3.1 Gibbs distribution A Gibbs distribution is parameterized by a set of general factors Φ = {ϕ1 (D1 ), . . . , ϕk (Dk )} which can have a scope of ≥ 2 variables (whereas pairwise Markov networks were limited to two variable scopes). As a result, this can express any probability distribution because we can just define a factor over all the random variables. We also have: P̃Φ (X1 , . . . , Xn ) = k ∏ ϕi (Di ) i=1 ∑ ZΦ = P̃Φ (X1 , . . . , Xn ) X1 ,...,Xn Where ZΦ is the partition function, i.e. the normalizing constant. Thus we have: PΦ (X1 , . . . , Xn ) = 1 P̃Φ (X1 , . . . , Xn ) ZΦ We can generate an induced Markov network HΦ from a set of factors Φ. For each factor in the set, we connect any variables which are in the same scope. For example, ϕ1 (A, B, C), ϕ2 (B, C, D) leads to: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 273 9.3. MARKOV NETWORKS 274 Example induced Markov network So multiple set of factors can induce the same graph. We can go from a set of factors to a graph, but we can’t go the other way. We say a probability distribution P factorizes over a Markov network H if there exists a set of factors Φ such that P = PΦ and h is the induced graph for Φ. We have active trails in Markov networks as well: a trail X1 − · · · − Xn is active given the set of observed variables Z if no Xi is in Z. 9.3.2 Conditional Random Fields A commonly-used variant of Markov networks is conditional random fields (CRFs). This kind of model is used to deal with task-specific prediction, where we have a set of input/observed variables X and a set of target variables Y that we are trying to predict. Using the graphical models we have seen so far is not the best because we don’t want to model P (X, Y ) - we are already given X. Instead, we just want to model P (Y |X). That way we don’t have to worry about how features of X are correlated or independent, and we don’t have to model their distributions. In this scenario, we can use a conditional random field representation: Φ = {ϕ1 (D1 ), . . . , ϕk (Dk )} P̃Φ (X, Y ) = k ∏ ϕi (Di ) i=1 This looks just like a Gibbs distribution. The difference is in the partition function: ZΦ (X) = ∑ P̃Φ (X, Y ) Y So a CRF is parameterized the same as a Gibbs distribution, but it is normalized differently. The end result is: PΦ (Y |X) = 274 1 P̃Φ (X, Y ) ZΦ (X) CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 275 9.3. MARKOV NETWORKS Which is a family of conditional distributions, one for each possible value of X. In a Markov network, we have the concept of separation, which is like d-separation in Bayesian networks but we drop the “d” because they are not directed. X and Y are separated in H given observed evidence Z if there is no active trail in H (that is, no node along the trail is in Z). For example: Markov network separation example We can separate A and E in a few ways: • A and E are separated given B and D • A and E are separated given D • A and E are separated given B and C Like with Bayesian networks, we have a theorem: if P factorizes over H and sepH (X, Y |Z), then P satisfies (X ⊥ Y |Z). We can say the independences induced by the graph H, I(H), is: I(H) = {(X ⊥ Y |Z)|sepH (X, Y |Z)} If P satisfies I(H), we say that H is an I-map (independency map) of P (this is similar to I-maps in the context of Bayesian networks). We can also say that if P factorizes over H, then H is an I-map of P . The converse is also true: for a positive distribution P , if H is an I-map for P , then P factorizes over H. If a graph G is an I-map of P , it does not necessarily need to encode all independences of P , just those that it does encode are in fact in P . How well can we capture a distribution P ’s independences in a graphical model? We can denote all independences that hold in P as: CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 275 9.3. MARKOV NETWORKS 276 I(P ) = {(X ⊥ Y |Z)|P ⊨ (X ⊥ Y |Z)} We know that if P factorizes over G, then G is an I-map for P : I(G) ⊆ I(P ) The converse doesn’t hold; P may have some independences not in G. We want graphs which encode more independences because they are sparser (less parameters) and more informative. So for sparsity, we want a minimal I-map; that is, an I-map without redundant edges. But it is still not sufficient for capturing I(P ). Ideally, we want a perfect map, which is an I-map such that I(G) = I(P ). Unfortunately, not ever distribution has a perfect map, although sometimes a distribution may have a perfect map as a Markov network and not as a Bayesian network, and vice versa. It is possible that a perfect map for a distribution is not unique; that is, there may be other graphs which model the same set of independence assumptions and thus are also perfect maps. When graphs model the same independence assumptions, we say they are I-equivalent. Most graphs have many I-equivalent variants. 9.3.3 Log-linear models Log-linear models allow us to incorporate local structure into undirected models. In the original representation of unnormalized density, we had: P̃ = ∏ ϕi (Di ) i We turn this into a linear form: P̃ = exp(− ∑ wj fj (Dj )) j Hence the name “log-linear”, because the log is a linear function. Each feature fj has a scope Dj . Different features can have the same scope. We can further write it in the form: P̃ = ∏ exp(−wj fj (Dj )) j 276 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 277 9.4. REFERENCES which effectively turns the exp(−wj fj (Dj )) term into a factor with one parameter wj . For example, say we have binary variables X1 and X2 : [ a00 a01 ϕ(X1 , X2 ) = a10 a11 ] We must define the following features using indicator functions (1 if true, else 0): 00 f12 = ⊮{X1 = 0, X2 = 0} 01 f12 = ⊮{X1 = 0, X2 = 1} 10 f12 = ⊮{X1 = 1, X2 = 0} 11 f12 = ⊮{X1 = 1, X2 = 1} So we have the log-linear model: ϕ(X1 , X2 ) = exp(− ∑ wkl fijkl (X1 , X2 )) kl So we can represent any factor as a log-linear model by including the appropriate features. For example, say you want to develop a language model for labeling entities in text. You have target labels Y = {PERSON, LOCATION, . . . } and input words X. You could, for instance, have the following features: f (Yi , Xi ) = ⊮{Yi = PERSON, Xi is capitalized} f (Yi , Xi ) = ⊮{Yi = LOCATION, Xi appears in an atlas} and so on. 9.4 References • • • • Bayesian Reasoning and Machine Learning. David Barber. Probabilistic Graphical Models. Daphne Koller. Stanford University/Coursera. MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of Edinburgh (Coursera). 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 277 9.4. REFERENCES 278 278 CHAPTER 9. PROBABILISTIC GRAPHICAL MODELS 279 10 Optimization Optimization is the task of finding the arguments to a function which yield its minimum or maximum value. An optimal argument is denoted with an asterisk, e.g. x . In the context of machine learning, we are typically dealing with minimization. If necessary, a maximization problem can be reframed as minimization: to maximize a function f (x), you can instead minimize −f (x). The function we want to optimize is called the objective function. When the particular optimization is minimization, the function may also be referred to as a cost, loss, or error function. Optimization problems can be thought of a topology where you are looking for the global peak (if you are maximizing) or the globally lowest point (if you are minimizing). For simplicity, minimizing will be the assumed goal here, as you are often trying to minimize some error function. Consider a very naive approach: a greedy random algorithm which starts at some position in this topology, then randomly tries moving to a new position and checks if it is better. If it is, it sets that as the current solution. It continues until it has some reason to stop, usually because it has found a minimum. This is a local minimum; that is, it is a minimum relative to its immediately surrounding points, but it is not necessarily the global minimum, which is the minimum of the entire function. Local vs global minima This algorithm is greedy in that it will always prefer a better scoring position, even if it is only marginally better. Thus it can be easy to get CHAPTER 10. OPTIMIZATION 279 280 stuck in local optima - since any step away from it seems worse, even if the global optimum is right around the corner, so to speak. Optimization may be accomplished numerically or analytically. The analytic approach involves computing derivatives and then identifying critical points (e.g. the second derivative test). These provide the exact optima. However, this analytic approach is infeasible for many functions. In such cases we resort to numerical optimization, which involves, in a sense, guessing your way to the optima. The methods covered here are all numerical optimization methods. Within optimization there are certain special cases: 10.0.1 Convex optimization As far as optimization goes, convex optimization is easier to deal with - convex functions have only global minima (any local minimum is a global minimum) without any saddle points. 10.0.2 Constrained optimization In optimization we are typically looking to find the optimum across all points. However, we may only be interested in finding the optimum across a subset S of these points - in this case, we have a constrained optimization problem. The points that are in S are called feasible points. One method for constrained optimization is the Karush-Kuhn-Tucker (KKT) method. It uses the generalized Lagrangian (sometimes called the generalized Lagrange function). We must describe S as equations and inequalities; in particular, with m functions gi , called equality constraints, and n functions hj , called inequality constraints, such that S = {x|∀i, gi (x) = 0and∀j , hj (x) ≤ 0}. For each constraint we also have the variables λi , αi , called KKT multipliers. Then we can define the generalized Lagrangian as: L(x, ˘, ff) = f (x) + σi λi gi (x) + σj αj hj (x) This reframes the constrained minimization problem as an unconstrained optimization of the generalized Lagrangian. That is, so long as at least one feasible point exists and f (x) cannot be ∞, then min max max L(x, ˘, ff) x ˘ ff,ff≥0 has the same objective function value and set of optimal points as minx∈S f (x). So long as the constraints are satisfied, 280 CHAPTER 10. OPTIMIZATION 281 10.1. GRADIENT VS NON-GRADIENT METHODS max max L(x, ˘, ff) = f (x) ˘ ff,ff≥0 and if a constraint is violated, then max max L(x, ˘, ff) = ∞ ˘ ff,ff≥0 10.1 Gradient vs non-gradient methods Broadly we can categorize optimization methods into those that use gradients and those that do not: Non-gradient methods include: • hill climbing • simplex/amoeba/Nelder Mead • genetic algorithms Gradient methods include: • gradient descent • conjugate gradient • quasi-newton Gradient methods tend to be more efficient, but are not always possible to use (you don’t always have the gradient). 10.2 Gradient Descent Gradient descent (GD) is perhaps the common minimizing optimization (for maximizing, its equivalent is gradient ascent) in machine learning. Say we have a function C(v ) which we want to minimize. For simplicity, we will use v ∈ R2 . An example C(v ) is visualized in the accompanying figure. In this example, the global minimum is visually obvious, but most of the time it is not (especially when dealing with far more dimensions). But we can apply the model of a ball rolling down a hill and expand it to any arbitrary n dimensions. The ball will “roll” down to a minimum, though not necessarily the global minimum. CHAPTER 10. OPTIMIZATION 281 10.2. GRADIENT DESCENT 282 The position the ball is at is a potential solution; here it is some values for v1 and v2 . We want to move the ball such that ∆C, the change in C(v ) from the ball’s previous position to the new position, is negative (i.e. the cost function’s output is smaller, since we’re minimizing). More formally, ∆C is defined as: ∆C ≈ ∂C ∂C ∆v1 + ∆v2 ∂v1 ∂v2 We define the gradient of C, denoted ∇C, to be the vector of partial derivatives (transposed to be a column vector): ( ∇C ≡ ∂C ∂C , ∂v1 ∂v2 )T Sphere function So we can rewrite ∆C as: ∆C ≈ ∇C · ∆v . We can choose ∆v to make ∆C negative: ∆v = −η∇C, Where η is a small, positive parameter (the learning rate), which controls the step size. Finally we have: ∆C ≈ −η∇C · ∇C = −η∥∇C∥2 We can use this to compute a value for ∆v , which is really the change in position for our “ball” to a new position v ′ : v → v ′ = v − η∇C. And repeat until we hit a global (or local) minimum. This process is known in particular as batch gradient descent because each step is computed over the entire batch of data. 282 CHAPTER 10. OPTIMIZATION 283 10.2.1 10.2. GRADIENT DESCENT Stochastic gradient descent (SGD) With batch gradient descent, the cost function is evaluated on all the training inputs for each step. This can be quite slow. With stochastic gradient descent (SGD), you randomly shuffle your examples and look at only one example for each iteration of gradient descent (sometimes this is called online gradient descent to contrast with minibatch gradient descent, described below). Ultimately it is less direct than batch gradient descent but gets you close to the global minimum - the main advantage is that you’re not iterating over your entire training set for each step, so though its path is more wandering, it ends up taking less time on large datasets. The reason we randomly shuffle examples is to avoid “forgetting”. For instance, say you have time series data where there are somewhat different patterns later in the data than earlier on. If that training data is presented in sequence, the algorithm will “forget” the patterns earlier on in favor of those it encounters later on (since the parameter updates learned from the later-on data will effectively erase the updates from the earlier-on data). In fact, stochastic gradient descent can help with finding the global minimum because instead of computing over a single error surface, you are working with many different error surfaces varying with the example you are current looking at. So it is possible that in one of these surfaces a local minima does not exist or is less pronounced than in others, which make it easier to surpass. There’s another form of stochastic gradient descent called minibatch gradient descent. Here b random examples are used for each iteration, where b is your minibatch size. It is usually in the range of 2-100; a typical choice might be 10 (minibatch gradient descent where b = 1 is just regular SGD). Note that a minibatch size can be too large, resulting in greater time for convergence. But generally it is faster than SGD and has the benefit of aiding in local minima avoidance. Minibatches also need to be properly representative of the overall dataset (e.g. be balanced for classes). When the stochastic variant is used, a 1 b term is sometimes included: η v → v ′ = v − ∇C. b 10.2.2 Epochs vs iterations An important point of clarification: an “epoch” and a training “iteration” are not necessarily the same thing. One training iteration is one step of your optimization algorithm (i.e. one update pass). In the case of something like minibatch gradient descent, one training iteration will only look at one batch of examples. An epoch, on the other hand, consists of enough training iterations to look at all your training examples. CHAPTER 10. OPTIMIZATION 283 10.3. SIMULATED ANNEALING 284 So if you have a total of 1000 training examples and a batch size of 100, one epoch will consist of 10 training iterations. 10.2.3 Learning rates The learning rate η is typically held constant. It can be slowly decreased it over time if you want θ to converge on the global minimum in stochastic gradient descent (otherwise, it just gets close). So for instance, you can divide it by the iteration number plus some constant, but this can be overkill. 10.2.4 Conditioning Conditioning describes how much the output of a function varies with small changes in input. In particular, if we have a function f (x) = A−1 x, A ∈ Rn×n , where A has an eigenvalue decomposition, we can compute its condition number as follows: max | i,j λi | λj Which is the ratio of the magnitude of the largest and smallest eigenvalue; when this is large, we say we have a poorly conditioned matrix since it is overly sensitive to small changes in input. This has the practical implications of slow convergence. In the context of gradient descent, if the Hessian is poorly conditioned, then gradient descent does not perform as well. This can be alleviated with Newton’s method, where a Taylor series expansion is used to approximate f (x) near some point x0 , going up to only the second-order derivatives: 1 f (x) ≈ f (x0 ) + (x − x0 )T ∇xf (x0 ) + (x − x0 )T H(f )(x0 )(x − x0 ) 2 Solving for the critical point gives: x = x0 − H(f )(x0 )−1 ∇x f (x0 ) (As a reminder, H(f ) is the Hessian of f ) Such methods which also use second-order derivatives (i.e. the Hessian) are known as second-order optimization algorithms; those that use only the gradient are called first-order optimization algorithms. 10.3 Simulated Annealing Simulated annealing is similar to the greedy random approach but it has some randomness which can “shake” it out of local optima. 284 CHAPTER 10. OPTIMIZATION 285 10.4. NELDER-MEAD (AKA SIMPLEX OR AMOEBA OPTIMIZATION) Annealing is a process in metal working where the metal starts at a very high temperature and gradually cools down. Simulated annealing uses a similar process to manage its randomness. A simulated annealing algorithm starts with a high “temperature” (or “energy”) which “cools” down (becomes less extreme) as progress is made. Like the greedy random approach, the algorithm tries a random move. If the move is better, it is accepted as the new position. If the move is worse, then there is a chance it still may be accepted; the probability of this is based on the current temperature, the current error, and the previous error: P (e, e ′ , T ) = exp( −(e ′ − e) ) T Each random move, whether accepted or not, is considered an iteration. After each iteration, the temperature is decreased according to a cooling schedule. An example cooling schedule is: k Tfinal kmax T (k) = Tinit Tinit where • • • • Tinit = the starting temperature Tfinal = the minimum/ending temperature k = the current iteration kmax = the maximum number of iterations For this particular schedule, you probably don’t want to set Tfinal to 0 since otherwise it would rapidly decrease to 0. Set it something close to 0 instead. The algorithm terminates when the temperature is at its minimum. 10.4 Nelder-Mead (aka Simplex or Amoeba optimization) For a problem of n dimensions, create a shape of n + 1 vertices. This shape is a simplex. One vertex of the simplex is initialized with your best educated guess of the solution vector. That guess could be the output of some other optimization approach, even a previous Nelder-Mead run. If you have nothing to start with, a random vector can be used. The other n vertices are created by moving in one of the n dimensions by some set amount. Then at each step of the algorithm, you want to (illustrations are for n = 2, thus 3 vertices): • Find the worst, second worst, and best scoring vertices CHAPTER 10. OPTIMIZATION 285 10.5. PARTICLE SWARM OPTIMIZATION 286 • Reflect the worst vertex to some point p ′ through the best side • If p ′ is better, expand by setting the worst vertex to a new point p ′′ , a bit further than p ′ but in the same direction • If p ′ is worse, then contract by setting the worst vertex to a new point p ′′ , in the same direction as p ′ but before crossing the best side Nelder-Mead The algorithm terminates when one of the following occurs: • The maximum number of iterations is reached • The score is “good enough” • The vertices have become close enough together Then the best vertex is considered the solution. This optimization method is very sensitive to how it is initialized; whether or not a good solution is found depends a great deal on its starting points. 10.5 Particle Swarm Optimization Particle swarm optimization is similar to Nelder-Mead, but instead of three points, many more points are used. These points are called “particles”. 286 CHAPTER 10. OPTIMIZATION 287 10.6. EVOLUTIONARY ALGORITHMS Each particle has a position (a potential solution) and a velocity which indicates where the particle moves to in the next iteration. Particles also keep track of their current error to the training examples and its best position so far. Globally, we also track the best position overall and the lowest error overall. The velocity for each particle is computed according to: • it’s inertia (i.e. the current direction it is moving in) • it’s historic best position (i.e. the best position it’s found so far) • the global best position The influence of these components are: • inertia weight • cognitive weight (for historic best position) • social weight (for global best position) These weights are parameters that must be tuned, but this method is quite robust to them (that is, they are not sensitive to these changes so you don’t have to worry too much about getting them just right). More particles are better, of course, but more intensive. You can specify the number of epochs to run. You can also incorporate a death-birth cycle in which low-performing particles (those that seem to be stuck, for instance) get destroyed and a new randomly-placed particle is initialized in its place. 10.6 Evolutionary Algorithms Evolutionary algorithms are a type of algorithm which uses concepts from evolution - e.g. individuals, populations, fitness, reproduction, mutation - to search a solution space. 10.6.1 Genetic Algorithms Genetic algorithms are the most common class of evolutionary algorithms. • You have a population of “chromosomes” (e.g. possible solutions or parameters, which are also called “object variables”). These chromosomes are interchangeably referred to as “individuals” • There may be some mutation in the chromosomes (e.g. with binary chromosomes, sometimes 0s become 1s and vice versa or with continuous values, changes happen according to some step size) • Parents have children, in which their chromosomes crossover - the front part of one chromosome combines with the back part of another. This is also called recombination. CHAPTER 10. OPTIMIZATION 287 10.6. EVOLUTIONARY ALGORITHMS 288 • The genotype (the chromosome composition) is expressed as some phenotype (i.e. some genetically-determined properties) in some individuals • Then each of these individuals has some fitness value resulting from their phenotypes • These fitnesses are turned into some probability of survival (this selection pressure is what pushes the system towards an optimal individual) • Then the individuals are selected randomly based on their survival probabilities • These individuals form the new chromosome population for the next generation Each of these steps requires some decisions by the implementer. For instance, how do you translate a fitness score into a survival probability? Well, the simplest way is: fi Pi = ∑ i fi Where fi is the fitness of some individual i . However, depending on how you calculate fitness, this may not be appropriate. You could alternatively use a ranking method, in which you just look at the relative fitness rankings and not their actual values. So the most fit individual is most likely to survive, the second fit is a bit less likely, and so on. You pick a probability constant PC , and the survival of the top-ranked individual is PC , that of the second is (1 − PC )PC , that of the third is, (1 − PC )2 PC , and so on. So Pn−1 = (1 − PC )n−2 PC and Pn = (1 − PC )n−1 . If you get stuck on local maxima you can try increasing the step size. When your populations start to get close to the desired value, you can decrease the step size so the changes are less sporadic (i.e. use simulated annealing). When selecting a new population, you can incorporate a diversity rank in addition to a fitness rank. This diversity ranking tries to maximize the diversity of the new population. You select one individual for the new population, and then as you select your next individual, you try and find one which is distinct from the already selected individuals. The general algorithm is as follows: 1. 2. 3. 4. randomly initialize a population of µ individuals compute fitness scores for each individual randomly choose µ2 pairs of parents, weighted by fitness (see above for an example), to reproduce with probability Pc (a hyperparameter, e.g. 0.8), perform crossover on the parents to form two children, which replaces the old population (you may also choose the keep some of the old population, rather than having two children per pair of parents, as per above - there is no universal genetic algorithm; you typically need to adjust it for a particular task) 5. randomly apply mutation to some of the population with probability Pm (a hyperparameter, e.g. 0.01) 288 CHAPTER 10. OPTIMIZATION 289 10.6. EVOLUTIONARY ALGORITHMS 6. repeat The specifics of how crossover and mutation work depend on your particular problem. 10.6.2 Evolution Strategies With evolution strategies, each individual includes not only object variables but also “strategy” parameters which, which are variances and covariances (optional) of the object variables. These strategy parameters control mutation. From each population of size µ, λ offspring are generated (e.g. λ = 7µ). All of the object variables, as with genetic algorithms, are derived from the same parents, though each strategy parameter may be derived from a different pair of parents, selected at random (without any selection pressure). However, the best approach is to copy an object variable from one of the parents and set each strategy parameter to be the mean of its parents’ corresponding strategy parameter. Then mutation mutates both the strategy parameters and the object variables, starting with the strategy parameters. The mutation of the strategy parameters is called self-adaptation. The object variables are mutated according to the probability distribution specified by the (mutated) strategy parameters. There are two approaches to selection for evolutionary strategy: • (µ, λ) selection just involves taking the best µ individuals from the λ offspring. • (µ + λ) selection involves selecting the best µ individuals from the union of the λ offspring and the µ parents. (µ, λ) selection is recommended because (µ + λ) selection can interfere with self-adaptation. 10.6.3 Evolutionary Programming Evolutionary programming does not include recombination; changes to individuals rely solely on mutation. Mutations are based on a Gaussian distribution, where the standard deviation is the square root of a linear transform (parametrized according to the user) of the parent’s fitness score. Each of the µ parents yields one offspring. Note that in meta-evolutionary programming, the variances are also part of the individual, i.e. subject to mutation (this is self-adaptation). The next generation is selected from the union of the parents and the offspring via a process called q-tournament selection. Each candidate is paired with q (a hyperparameter) randomly selected opponents and receives a score which is the number of these q opponents that have a worse fitness score than the candidate. The top-scoring µ candidates are kept as the new generation. Increasing q causes the selection pressure to be both higher and more deterministic. CHAPTER 10. OPTIMIZATION 289 10.7. DERIVATIVE-FREE OPTIMIZATION 290 10.7 Derivative-Free Optimization Note that Nelder-Mead, Particle Swarm, and genetic algorithm optimization methods are sometimes known as “derivative-free” because they do not involve computing derivatives in order to optimize. 10.8 Hessian optimization Also known as the “Hessian technique”. Given a function f (X), where X = [x1 , x2 , . . . , xn ], we can approximate f near a point X using Taylor’s theorem: f (X + ∆X) = f (X) + ∑ ∂f j ∂Xj ∆Xj + 1∑ ∂ 2f ∆Xj ∆Xk + . . . 2 jk ∂Xj ∂Xk 1 = f (X) + ∇f · ∆X + ∆X T H∆X + . . . 2 Where H is the Hessian matrix (the jkth entry is ∂2f ∂Xj ∂Xk ). We can approximate f by dropping the higher-order terms (the ellipsis, . . . , terms): 1 f (X + ∆X) ≈ f (X) + ∇f · ∆X + ∆X T H∆X 2 Assuming the Hessian matrix is positive definite, we can show using calculus that the right-hand expression can be minimized to: ∆X = −H −1 ∇f If f is a cost function C and X are the parameters θ to the cost function, we can minimize the cost by updating θ according to the following algorithm (where η is the learning rate): • • • • Initialize θ. Update θ to θ′ = θ − ηH −1 ∇C, computing H and ∇C at θ. Update θ′ to θ′′ = θ − ηH ′−1 ∇′ C, computing H ′ and ∇′ C at θ′ . And so on. The second derivatives in the Hessian tells us how the gradient is changing, which provides some advantages (such as convergence speed) over traditional gradient descent. The Hessian matrix has n2 elements, where n is the number of parameters, so it can be extremely large. In practice, computing the Hessian can be quite difficult. 290 CHAPTER 10. OPTIMIZATION 291 10.9. ADVANCED OPTIMIZATION ALGORITHMS 10.9 Advanced optimization algorithms There are other advanced optimization algorithms, such as: • Conjugate gradient • BFGS • L-BFGS These shouldn’t be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they’re doing. They are more complex, but (in the context of machine learning) there’s no need to manually pick a learning rate α and they are often faster than gradient descent. So you can take advantage of them via some library which has them implemented (though some implementations are better than others). 10.10 • • • • • • • References Swarm Intelligence Optimization using Python. James McCaffrey. PyData 2015. Machine Learning. 2014. Andrew Ng. Stanford University/Coursera. Neural Networks and Deep Learning, Michael A Nielsen. Determination Press, 2015. Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. Genetic and Evolutionary Algorithms. Gareth Jones. Advanced Topics: Reinforcement Learning, Lecture 7. David Silver. An Interactive Tutorial on Numerical Optimization. Ben Frederickson. CHAPTER 10. OPTIMIZATION 291 10.10. REFERENCES 292 292 CHAPTER 10. OPTIMIZATION 293 11 Algorithms We measure the performance of algorithms in terms of time complexity (how many steps are taken, as a function of input size) and space complexity (how much memory is used, as a function of input size). There may be trade-offs for better time and space complexity - for instance, there may be situations where speed is more important than correctness (a correct algorithm is one that always terminates in the correct answer), in which case we may opt for a faster algorithm that provides only an approximate solution or a local optimum. There may be trade-offs between time and space complexity as well - for instance, we might have the choice between a fast algorithm that uses a lot of memory, or a slower one that has lower memory requirements. It’s important to be able to evaluate and compare algorithms in order to determine which is most appropriate for a particular problem. 11.1 Algorithm design paradigms Algorithms generally fall into a few categories of “design paradigms”: • decrease-and-conquer: recursively reduce the problem into a smaller and smaller problem until it can no longer be reduced, then solve that problem. • divide-and-conquer: break up the problem into subproblems which can be recursively solved, then combine the results of the subproblems in some way to form the solution for the original problem. • greedy: take the solution that looks best at the moment; i.e. always go for the local optimum, in the hope that it may the global optimum, even though it may not be. • dynamic programming: the divide-and-conquer paradigm can lead to redundant computations if the same subproblem appears multiple times; dynamic programming involves memoizing CHAPTER 11. ALGORITHMS 293 11.2. ALGORITHMIC ANALYSIS 294 (“remembering”, i.e. storing in memory) results of computations so that when an identical subproblem is encountered, the solution can just be retrieved from memory. • brute force: try everything until you get a solution 11.2 Algorithmic Analysis The number of operations for an algorithm is typically a function of its inputs, typically denoted n. There are a few kinds of algorithmic analysis: • worst-case analysis: the upper bound running time that is true for any arbitrary input of length n • average-case analysis: assuming that all input are equally likely, the average running time • benchmarks: runtime on an agreed-upon set of “typical” inputs Average-case analysis and benchmarks requires some domain knowledge about what inputs to expect. When you want to do a more “general purpose” analysis, worst-case analysis is preferable. When comparing the performance of algorithms: • we measure the algorithm in terms of the number of steps (operations) required, which provides a consistent measurement across machines (otherwise, some machines are more powerful and naturally perform faster) • algorithms may perform differently depending on characteristics of input data, e.g. if it is already partially sorted and so on. So we look at the worst case scenario to compensate for this. • performance can change depending on the size of the input as well (for example, an algorithm A which appears slower on a smaller dataset than B may in fact be faster than B on larger datasets), so we look at algorithmic performance as a function of input size, and look at the asymptotic performance as the problem size increases. We focus on asymptotic analysis; that is, we focus on performance for large input sizes. Note, however, that algorithms which are inefficient for large n may be better on small n when compared to algorithms that perform well for large n. For example, insertion sort has an upper bound 2 runtime of n2 , which, for small n (e.g. n < 90), is better than merge sort. This is because constant factors are more meaningful with small inputs. Anyways, with small n, it often doesn’t really matter what algorithm you use, since the input is so small, there are unlikely to be significant performance differences, so analysis of small input sizes is not very valuable (or interesting). Thus we define a “fast” algorithm as one in which the worst-case running time grows slowly with input size. 294 CHAPTER 11. ALGORITHMS 295 11.2.1 11.2. ALGORITHMIC ANALYSIS Asymptotic Analysis With asymptotic analysis, we suppress constant factors and lower-order terms, since they don’t matter much for large inputs, and because the constant factors can vary quite a bit depending on the architecture, compiler, programmer, etc. For example, if we have an algorithm with the upper bound runtime of 6n log2 n + 6n, we would rewrite it as just n log n (note that log typically implies log2 ). Then we say the running time is O(n log n), said “big-oh of n log n”, the O implies that we have dropped the constant factors and lower-order terms. Generally, we categorize big-oh performance by order of growth, which define a set of algorithms that grow equivalently: | Order of growth | Name | |—————–|—————————| | O(1) | constant | | O(logb n) | logarithmic (for any b) | | O(n) | linear | | O(n logb n) | n log n | | O(n2 ) | quadratic | | O(n3 ) | cubic | | O(c n )|exponenti al(f or any c$) | The order of growth is determined by the leading term, that is, the term with the highest exponent. Note for log, the base doesn’t matter because it is equivalent to multiplying by a constant, which we ignore anyways. 11.2.2 Loop examples Consider the following algorithm for finding an element in an array: def func(i, arr): for el in arr: if el == i: return True return False This has the running time of O(n) since, in the worst case, it checks every item. Now consider the following: def func2(i, arr): return func(i, arr), func(i, arr) This still has the running time of O(n), although it has twice the number of operations (i.e. ∼ 2n operations total), we drop the constant factor 2. Now consider the following algorithm for checking if two arrays have a common element: CHAPTER 11. ALGORITHMS 295 11.2. ALGORITHMIC ANALYSIS 296 def func3(arr1, arr2): for el1 in arr1: for el2 in arr2: if el1 == el2: return True return False This has a runtime of O(n2 ), which is called a quadratic time algorithm. The following algorithm for checking duplicates in an array also has a runtime of O(n2 ), again due to dropping constant factors: def func4(arr): for i, el1 in enumerate(arr): for el2 in arr[i:]: if el1 == el2: return True return False 11.2.3 Big-Oh formal definition Say we have a function T (n), n ≥ 0, which is usually the worst-case running time of an algorithm. We say that T (n) = O(f (n)) if and only if there exist constants c, n0 > 0 such that T (n) ≤ cf (n) for all n ≥ n0 . That is, we can multiply f (n) by some constant c such that there is some value n0 , after which T (n) is always below cf (n). For example: we demonstrated that 6n log2 n + 6n is the worst-case running time for merge sort. For merge sort, this is T (n). We described merge sort’s running time in big-oh notation with O(n log n). This is appropriate because there exists some constant c we can multiply n log n by such that, after some input size n0 , cf (n) is always larger than T (n). In this sense, n0 defines a sufficiently large input. As a simple example, we can prove that 2n+10 = O(2n ). So the inequality is: 2n+10 ≤ c2n 296 CHAPTER 11. ALGORITHMS 297 11.3. THE DIVIDE-AND-CONQUER PARADIGM We can re-write this: 210 2n ≤ c2n Then it’s clear that if we set c = 210 , this inequality holds, and it happens to hold for all n, so we can just set n0 = 1. Thus 2n+10 = O(2n ) is in fact true. 11.2.4 Big-Omega notation T (n) = Ω(f (n)) if and only if there exist constants c, n0 > 0 such that T (n) ≥ cf (n) for all n ≥ n0 . That is, we can multiply f (n) by some constant c such that there is some value n0 , after which T (n) is always above cf (n). 11.2.5 Big-Theta notation T (n) = Θ(f (n)) if and only if T (n) = O(f (n)) and T (n) = Ω(f (n)). That is, T (n) eventually stays sandwiched between c1 f (n) and c2 f (n) after some value n0 . 11.2.6 Little-Oh notation Stricter than big-oh, in that this must be true for all positive constants. T (n) = o(f (n)) if and only if for all constants c > 0, there exists a constant n0 such that T (n) ≤ cf (n) for all n ≥ n0 . 11.3 The divide-and-conquer paradigm This consists of: • Divide the problem into smaller subproblems. You do not have to literally divide the problem in the algorithm’s implementation; this may just be a conceptual step. • Compute the subproblems using recursion. • Combine the subproblem solutions into the problem for the original problem. 11.3.1 The Master Method/Theorem The Master Method provides an easy way of computing the upper-bound runtime for a divide-andconquer algorithm, so long as it satisfies the assumptions: Assume that subproblems have equal size and the recurrence has the format: CHAPTER 11. ALGORITHMS 297 11.3. THE DIVIDE-AND-CONQUER PARADIGM 298 • Base case: T (n) ≤ c, where c is a constant, for all sufficiently small n. • For all larger n: T (n) ≤ aT ( bn ) + O(nd ), where a is the number of recursive calls, a ≥ 1, each subproblem has input size nb , b > 1, and outside each recursive call, we do some additional O(nd ) of work (e.g. a combine step), parameterized by d, d ≥ 0. O(nd logn ) T (n) = O(nd ) O(n logb a ) if a = bd if a < bd if a > bd Note that the logarithm base does not matter in the first case, since it just changes the leading constant (which doesn’t matter in Big-Oh notation), whereas in the last case the base does matter, because it has a more significant effect. To use the master method, we just need to determine a, b, d. For example: With merge sort, there are two recursive calls (thus a = 2), and the input size to each recursive call is half the original input (thus b = 2), and the combine step only involves the merge operation (d = 1). Thus, working out the master method inequality, we get 2 = 21 , i.e. a = bd , thus: T (n) = O(n logn ) Proof of the Master Method (Carrying over the previously-stated assumptions) For simplicity, we’ll also assume that n is a power of b, but this proof holds in the general case as well. In the recursion tree, at each level j = 0, 1, 2, . . . , logb n, there are aj subproblems, each of size n/bj . At a level j, the total work, not including the work in recursive calls, is: ≤ aj c( n d ) bj Note that the a and b terms are dependent on the level j, but the c, n, d terms are not. We can rearrange the expression to separate those terms: ≤ cnd ( 298 a j ) bd CHAPTER 11. ALGORITHMS 299 11.4. DATA STRUCTURES To get the total work, we can sum over all the levels: ≤ cn d log bn ∑ j=0 ( a j ) bd We can think of a as the rate of subproblem proliferation (i.e. how the number of subproblems grow with level depth) and bd as the rate of work shrinkage per subproblem. There are three possible scenarios, corresponding to the master method’s three cases: • If a < bd , then the amount of work decreases with the recursion level. • If a > bd , then the amount of work increases with the recursion level. • If a = bd , then the amount of work stays the same with the recursion level. ∑ logb n a j If a = bd , then the summation term in the total work expression, j=0 ( bd ) , simply becomes d logb n + 1, thus the total work upper bound in that case is just cn (logb n + 1), which is just O(nd log n). A geometric sum for r ̸= 1: 1 + r + r2 + r3 + · · · + rk Can be expressed in the following closed form: r k+1 − 1 r −1 1 If r < 1 is constant, then this is ≤ 1−r (i.e. it is some constant independent of k). If r > 1 is 1 1 k constant, then this is ≤ r (1 + r −1 ), where the last term (1 + r −1 ) is a constant independent of k. We can bring this back to the master method by setting r = a bd . If a < bd , then the summation term in the total work expression becomes a constant (as demonstrated with the geometric sum); thus in Big-Oh, that summation term drops, and we are left with O(nd ). If a > bd , then the summation term in the total work becomes a constant times r k , where k = logb n, i.e. the summation term becomes a constant times ( bad )logb n . So we get O(nd ( bad )logb n ), which simplifies to O(alogb n ), which ends up being the number of leavens in the recursion tree. This is equivalent to O(nlogb a ). 11.4 Data Structures Data structures are particular ways of organizing data that support certain operations. Structures have strengths and weaknesses in what kinds of operations they can perform, and how well they can perform them. CHAPTER 11. ALGORITHMS 299 11.4. DATA STRUCTURES 300 For example: lists, stacks, queues, heaps, search trees, hashtables, bloom filters, etc. Different data structures are appropriate for different operations (and thus different kinds of problems). 11.4.1 Heaps A heap (sometimes called a priority queue) is a container for objects that have keys; these keys are comparable (e.g. we can say that one key is bigger than another). Supported operations: • insert: add a new object to the heap, runtime O(log n) • extract-min: remove the object with the minimum key value (ties broken arbitrarily), runtime O(log n) Alternatively, there are max-heaps which return the maximum key value (this can be emulated by a heap by negating key values such that the max becomes the min, etc). Sometimes there are additional operations supported: • heapify: initialize a heap in linear time (i.e. O(n) time, faster than inserting them one-by-one) • delete: delete an arbitrary element from the middle of the heap in O(log n) time For example, you can have a heap where events are your objects and their keys are a scheduled time to occur. Thus when you extract-min you always get the next event scheduled to occur. 11.4.2 Balanced Binary Search Tree A balanced binary search tree can be thought of as a dynamic sorted array (i.e. a sorted array which supports insert and delete operations). First, consider sorted arrays. They support the following operations: • • • • search: binary search, runtime O(log n) select: select an element by index, runtime O(1) min and max: return first and last element of the array (respectively), runtime O(1) predecessor and successor: return next smallest and next largest element of the array (respectively), runtime O(1) • rank: the number of elements less than or equal to a given value, runtime O(log n) (search for the given value and return the position) • output: output elements in sorted order, runtime O(n) (since they are already sorted) With sorted arrays, insertion and deletions have O(n) runtime, which is too slow. If you want more logarithmic-time insertions and deletions, we can use a balanced binary search tree. This supports the same operations as sorted arrays (though some are slower) in addition to faster insertions and deletions: 300 CHAPTER 11. ALGORITHMS 301 • • • • • • • • 11.4. DATA STRUCTURES search: runtime O(log n) select: runtime O(log n) (slower than sorted arrays) min and max: runtime O(log n) (slower than sorted arrays) predecessor and successor: runtime O(log n) (slower than sorted arrays) rank: runtime O(log n) output: runtime O(n) insert: runtime O(log n) delete: runtime O(log n) To understand how balanced binary search trees, first consider binary search trees. Binary Search Tree A binary search tree (BST) is a data structure for efficient searching. The keys that are stored are the nodes of the tree. Each node has three pointers: one to its parent, one to its left child, and one to its right child. These pointers can be null (e.g. the root node has no parent). The search tree property asserts that for every node, all the keys stored in its left subtree should be less than its key, and all the keys in its right subtree should be greater than its key. You can also have a convention for handling equal keys (e.g. just put it on the left or the right subtree). This search tree property is what makes it very easy to search for particular values. Note that there are many possible binary search trees for a given set of keys. The same set of keys could be arranged as a very deep and narrow BST, or as a very shallow and wide one. The worst case is a depth of about n, which is more of a chain than a tree; the best case is a depth of about log2 n, which is perfectly balanced. This search tree property also makes insertion simple. You search for the key to be inserted, which will fail since the key is not in the tree yet, and you get a null pointer - you just assign that pointer to point to the new key. In the case of duplicates, you insert it according to whatever convention you decided on (as mentioned previously). This insert method maintains the search tree property. Search and insert performance are dependent on the depth of the tree, so at worst the runtime is O(height). The min and max operations are simple: go down the leftmost branch for the min key and go down the rightmost branch for the max key. The predecessor operation is a bit more complicated. First you search for the key in question. Then, if the key’s node has a left subtree, just take the max of that subtree. However, if the key does not have a left subtree, move up the tree through its parent and ancestors until you find a node with a key less than the key in question. (The successor operation is accomplished in a similar way.) The deletion operation is tricky. First we must search for the key we want to delete. Then there are three possibilities: CHAPTER 11. ALGORITHMS 301 11.4. DATA STRUCTURES 302 • the node has no children, so we can just delete it and be done • the node has one child; we can just delete the node and replace it with its child • the node has two children; we first compute the predecessor of the node, then swap it with the node, then delete the node For the select and rank operations, we can augment our search tree by including additional information at each node: the size of its subtree, including itself (i.e. number of descendants + 1). Augmenting the data structure in this way does add some overhead, e.g. we have to maintain the data/keep it updated whenever we modify the tree. For select, we want to find the i th value of the data structure. Starting at a node x, say a is the size of its left subtree. If a = i − 1, return x. If a ≥ i , recursively select the i th value of the left subtree. If a < i − 1, recursively select the (i − a − 1)th value of the right subtree. Balanced Binary Search Trees (Red-Black Trees) Balanced binary search trees have the “best” depth of about log2 n. There are different kinds of balanced binary search trees (which are all quite similar), here we will talk about red-black trees (other kinds are AVL trees, splaytrees, B trees, etc). Red-Black trees maintain some invariants/constraints which are what guarantee that the tree is balanced: 1. 2. 3. 4. each node stores an additional bit indicating if it is a “red” or “black” node the root is always black never allow two reds in a row (e.g. all of a red node’s children are black) every path from the root node to a null pointer (e.g. an unsuccessful search) must go through the same number of black nodes Consider the following: for a binary search tree, if every root-null path has ≥ k nodes, then the tree includes at the top a perfectly balanced search tree of depth k − 1. Thus there must be at least 2k − 1 nodes in the tree, i.e. n ≥ 2k − 1. We can restate this as k ≤ log2 (n + 1) In a red-black tree, there is a root-null path with at most log2 (n + 1) black nodes (e.g. it can have a root-null path composed of only black nodes). The fourth constraint on red-black trees means that every root-null path has ≤ log2 (n + 1) black nodes. The third constraint means that we can never have more red nodes than black nodes (because the red nodes can never come one after the other) in a path. So at most a root-null path will have ≤ 2log2 (n + 1), which gives us a balanced tree. 11.4.3 Hash Tables Hash tables (also called dictionaries) allow us to maintain a (possibly evolving) set of stuff. The core operations include: 302 CHAPTER 11. ALGORITHMS 303 11.4. DATA STRUCTURES • insert using a key • delete using a key • lookup using a key When implemented properly, and on non-pathological data, these operations all run in O(1) time. Hash tables do not maintain any ordering. Basically, hash tables use some hash function to produce a hash for an object (some number); this hash is the “address” of the object in the hash table. More specifically, we have a hash function which gives us a value in some range [0, n]; we have an array of length n, so the hash function tells us and what index to place some object. There is a chance of collisions in which two different objects produce the same hash. There are two main solutions for resolving collisions: • (separate) chaining: if there is a collision, store the objects together at that index as a list • open addressing: here, a hash function specifies a sequence (called a probe sequence) instead of a single value. Try the first value, if its occupied, try the next, and so on. • one strategy, linear probing, just has you try the hash value + 1 and keep incrementing by one until an empty bucket is found • another is double hashing, in which you have two hash functions, you look at the first hash, if occupied, offset by the second hash until you find an empty bucket Each is more appropriate in different situations. The performance of a hash table depends a lot on the particular hash function. Ideally, we want a hash function that: • has good performance (i.e. low collisions) • should be easy to store • should be fast to evaluate (constant time) Designing hash functions is as much an art as it is a science. They are quite difficult to design. A hash table has a load factor (sometimes just called load), denoted α = num. objects in hash table num. buckets in hash table . For hash table operations to run constant time, it is necessary that α = O(1). Ideally, it is less than 1, especially with open addressing. So for good performance, we need to control load. For example, if α passes some threshold, e.g. 0.75, then we may want to expand the hash table to lower the load. Every hash function has a “pathological” data set which it performs poorly on. There is no hash function which is guaranteed to spread out every data set evenly (i.e. have low collisions on any arbitrary data set). You can often reverse engineer this pathological data set by analyzing the hash function. CHAPTER 11. ALGORITHMS 303 11.5. P VS NP 304 However, for some hash functions, it is “infeasible” to figure out its pathological data set (as is the case with cryptographic hash functions). One approach is to define a family of hash functions, rather than just one, and randomly choose a hash function to use at runtime. This has the property that, on average, you do well across all datasets. 11.4.4 Bloom Filters Bloom filters are a variant on hash tables - they are more space efficient, but they allow for some errors (that is, there’s a chance of a false positive). In some contexts, this is tolerable. They are more space efficient because they do not actually store the objects themselves. They are more commonly used to keep track of what objects have been seen so far. They typically do not support deletions (there are variants that do incorporate deletions but they are more complicated). There is also a small chance that it will say that it’s seen an object that it hasn’t (i.e. false positives). Like a hash table, a bloom filter consists of an array A, but each entry in the array is just one bit. n Say we have a set of objects S and the total number of bits n - a bloom filter will use only |S| bits per object in S. We also have k hash functions h1 , . . . , hk (usually k is small). The insert operation is defined: hash_funcs = [...] for i in range(k): A[h[i](input)] = 1 That is, we just set the values of those bits to 1. Thus lookup is just to check that all the corresponding bits for an object are 1. We can’t have any false negatives because those bits will not have been set and bits are never reset back to zero. False positives are possible, however, because some other objects may have in aggregate set the bits corresponding to another object. 11.5 P vs NP Consider the following problem: 7 * 13 = ? 304 CHAPTER 11. ALGORITHMS 305 11.5. P VS NP This is solved very quickly by a computer (it gives 91). Now consider the following factoring problem : ? * ? = 91 This a bit more difficult for a computer to solve, though it will yield the correct answers (7 and 13). If we consider extremely large numbers, a computer can still very quickly compute their product. But, given a product, it will take a computer a very, very long time to compute their factors. In fact, modern cryptography is based on the fact that computers are not good at finding factors for a number (in particular, prime factors). This is because computers basically have to use brute force search to identify a factor; with very large numbers, this search space is enormous (it grows exponentially). However, once we find a possible solution, it is easy to check that we are correct (e.g. just multiply the factors and compare the product). There are many problems which are characterized in this way - they require brute force search to identify the exact answer (there are often faster ways of getting approximate answers), but once an answer is found, it can be easily checked if it is correct. There are other problems, such as multiplication, where we can easily “jump” directly to the correct exact answer. For the problems that require search, it is not known whether or not there is also a method that can “jump” to the correct answer. Consider the “needle in the haystack” analogy. We could go through each piece of hay until we find the needle (brute force search). Or we could use a magnet to pull it out immediately. The question is open: does this magnet exist for problems like factorization? Problems which we can quickly solve, like multiplication, are in a family of problems called “P”, which stands for “polynomial time” (referring to the relationship between the number of inputs and how computation time increases). Problems which can be quickly verified to be correct are in a family of problems called “NP”, which stands for “nondeterministic polynomial time”. P is a subset of NP, since their answers are quickly verified, but NP also includes the aforementioned search problems. Thus, a major question is whether or not P = NP, i.e. we have the “magnet” for P problems, but is there also one for the rest of the NP problems? Or is searching the only option? 11.5.1 NP-hard Problems which are NP-hard are at least as hard as NP problems, so this includes problems which may not even be in NP. CHAPTER 11. ALGORITHMS 305 11.6. REFERENCES 11.5.2 306 NP-completeness There are some NP problems which all NP problems can be reduced to. Such an NP problem is called NP-complete. For example, any NP problem can be reduced to a clique problem (e.g. finding a clique of some arbitrary size in a graph); thus the clique problem is NP-complete. Any other NP problem can be reduced to a clique problem, and if we can find a way of solving the clique problem quickly, we can also all of those related problems quickly as well. To show that a problem is NP complete, you must: • Show that it is in NP: that is, show that there is a polynomial time algorithm that can verify whether or not an answer is correct • Show that it is NP-hard: that is, reduce some known NP-complete problem to your problem in polynomial time. 11.6 References • • • • 306 Algorithms: Design and Analysis, Part 1. Tim Roughgarden. Stanford/Coursera. Think Complexity. Version 1.2.3. Allen B. Downey. 2012. Beyond Computation: The P vs NP Problem. Michael Sipser, MIT. Tuesday, October 3, 2006. Algorithmic Puzzles. Anany Levitin, Maria Levitin. 2011. CHAPTER 11. ALGORITHMS 307 Part II Machine Learning 307 309 12 Overview 12.1 Representation vs Learning: • Representation: whether or not a function can be simulated by the model; i.e. is the model capable of representing a given function? • Learning: whether or not their exists an algorithm with which the weights can be adjusted to represent a particular function 12.2 Types of learning • supervised learning - the learning algorithm is provided with pre-labeled training examples to learn from. • unsupervised learning - the learning algorithm is provided with unlabeled examples. Generally, unsupervised learning is used to uncover some structure of or pattern in the data. • semi-supervised learning - the learning algorithm is provided with a mixture of labeled and unlabeled data. • active learning - similar to semi-supervised learning, but the algorithm can “ask” for extra labeled data based on what it needs to improve on. • reinforcement learning - actions are taken and rewarded or penalized in some way and the goal is maximizing lifetime/long-term reward (or minimizing lifetime/long-term penalty). 12.3 References • Neural Computing: Theory and Practice (1989). Philip D. Wasserman CHAPTER 12. OVERVIEW 309 12.3. REFERENCES 310 310 CHAPTER 12. OVERVIEW 311 13 Supervised Learning In supervised learning, the learning algorithm is provided some pre-labeled examples (a training set) to learn from. In regression problems, you try to predict some continuous valued output (i.e. a real number). In classification problems, you try to predict some discrete valued output (e.g. categories). Typical notation: • • • • • m = number training of examples x’s = input variables or features y ’s = output variables or the “target” variable (x (i) , y (i) ) = the i th training example h = the hypothesis, that is, the function that the learning algorithm learns, taking x’s as input and outputting y ’s The typical process is: • • • • Feed training set data into the learning algorithm The learning algorithm learns the hypothesis h Input new data into h Get output from h The hypothesis can thought of as the model that you try to learn for a particular task. You then use this model on new inputs, e.g. to make predictions - generalization is how the model performs on new examples; this is most important in machine learning. CHAPTER 13. SUPERVISED LEARNING 311 13.1. BASIC CONCEPTS 312 13.1 Basic concepts • Capacity: the flexibility of a model - that is, the variety of functions it can fit. – Representational capacity - the functions which the model can learn – Effective capacity - in practice, a learning algorithm is not likely to find the best function out of the possible functions it can learn, though it can learn one that performs exceptionally well - those functions that the learning algorithm is capable of finding defines the model’s effective capacity. • Hypothesis space: the set of functions the model is limited to learning. For instance, linear regression can be limited to linear functions as its hypothesis space, or it can be expanded to learn polynomials as well, e.g. by introducing an x 2 term. • Hyperparameter: a parameter of a model that is not learned (that is, you specify it yourself) • Underfitting: when the model could achieve better generalization with more training or capacity. Characterized by a high training error. • Overfitting: when the model could achieve better generalization with more training or capacity; in particular, the model is too tuned to the idiosyncrasies of the training data (for instance, it may fit to sampling error, which we don’t want). Too much capacity can lead to overfitting in that the model may be able to learn functions too specific to the data. Characterized by a large gap between the training error and the test error. • Model selection: the process of choosing the best hyperparameters on a validation set If the true function is in your hypothesis space H, we say it is realizable in H. In machine learning, there are generally two kinds of problems: regression and classification problems. Machine learning algorithms are typically designed for one or the other. 13.1.1 Regression Regression involves fitting a model to data. The goal is to understand the relationship between one set of variables - the dependent or response or target or outcome or explained variables (e.g. y ) - and another set - the independent or explanatory or predictor or regressor variables (e.g. X or x). In cases of just one dependent and one explanatory variable, we have simple regression. In scenarios with more than one explanatory variable, we have multiple regression. In scenarios with more than one dependent variable, we have multivariate regression. With linear regression we expect that the dependent and explanatory variables have a linear relationship; that is, can be expressed as a linear combination of random variables, i.e.: y = β0 + β1 x1 + · · · + βn xn + ε For some dependent variable y and explanatory variables x1 , . . . , xn , where ε is the residual due to random variation or other noisy factors. 312 CHAPTER 13. SUPERVISED LEARNING 313 13.2. OPTIMIZATION Of course, we do not know the true values for these β parameters (also called regression coefficients) so they end up being point estimates as well. We can estimate them as follows. When given data, one technique we can use is ordinary least squares, sometimes just called least squares regression, which looks for parameters β0 , . . . , βn such that the sum of the squared residuals ∑ (i.e. the SSE, i.e. e12 + · · · + en2 = ni=1 (yi − ŷi )2 ) is minimized (this minimization requirement is called the least squares criterion). The resulting line is called the least squares line. Note that linear model does not mean the model is necessarily a straight line. It can be polynomial as well - but you can think of the polynomial terms as additional explanatory variables; looking at it this way, the line (or curve, but for consistency, they are all called “lines”) still follows the form above. And of course, in higher-dimensions (that is, for multiple regression) we are not dealing with lines but planes, hyperplanes, and so on. But again, for the sake of simplicity, they are all just referred to as “lines”. For example, the line: y = β0 + β1 x1 + β2 x12 + ε Can be re-written: x2 = x12 y = β0 + β1 x1 + β2 x2 + ε When we use a regression model to predict a dependent variable, e.g. y , we denote it as a estimate by putting a hat over it, e.g. ŷ . 13.1.2 Classification Classification problems are where your target variables are discrete, so they represent categories or classes. For binary classification, there are only two classes (that is, y ∈ {0, 1}). We call the 0 class the negative class, and the 1 class the positive class. Otherwise, the classification problem is called a multiclass classification problem - there are more than two classes. 13.2 Optimization Much of machine learning can be framed as optimization problems - there is some kind of objective (also called loss or cost) function which we want to optimize (e.g. minimize classification error on the training set). Typically you are trying to find some parameters for your model, θ, which minimizes this objective or loss function. CHAPTER 13. SUPERVISED LEARNING 313 13.2. OPTIMIZATION 314 Generally this framework for machine learning is called empirical risk minimization and can be formulated: argmin θ 1∑ l(f (x (i) ; θ), y (i) ) + λΩ(θ) n i Where: • f (x (i) ; θ) is your model, which outputs some predicted value for the input x (i) and θ are the parameters for the model • y (i) is the training label (i.e. the ground-truth) for the input x (i) • l is the loss function • Ω(θ) is a regularlizer to penalize certain values of θ and λ is the regularization parameter (see below on regularlization) Some optimization terminology: • Critical points: x ∈ Rn |∇x f (x) = 0 • Curvature in direction v : v T ∇2x f (x)v • Types of critical points: – local minima: v T ∇2x f (x)v > 0, ∀v , that is ∇2x f (x) is positive definite – local maxima: v T ∇2x f (x)v < 0, ∀v , that is ∇2x f (x) is negative definite – saddle point: curvature is positive in some directions and negative in others 13.2.1 Cost functions So we have our training set {(x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) )} where y ∈ {0, 1} and with the hypothesis function from before. Here is the cost function for linear regression: J(θ) = m 1 ∑ 1 (hθ (x (i) ) − y (i) )2 m i=1 2 Note that the 12 is introduced for convenience, so that the square exponent cancels out when we differentiate. Introducing an extra constant doesn’t affect the result. Note that now the 1 2m has been split into 1 m and 12 . We can extract 12 (hθ (x (i) ) − y (i) )2 and call it Cost(hθ (x), y ). The cost function for logistic regression is different than that used for linear regression because the hypothesis function of logistic regression causes J(θ) to be non-convex, that is, look something like the following with many local optima, making it hard to converge on the global minimum. 314 CHAPTER 13. SUPERVISED LEARNING 315 13.2. OPTIMIZATION A saddle point A non-convex function CHAPTER 13. SUPERVISED LEARNING 315 13.2. OPTIMIZATION 316 So we want to find a way to define Cost(hθ (x), y ) such that it gives us a convex J(θ). We will use: −log(hθ (x)) Cost(hθ (x), y ) = −log(1 − hθ (x)) if y = 1 if y = 0 Some properties of this function is that if y = hθ (x) then Cost = 0, and as hθ (x) → 0, Cost → ∞. We can rewrite cost in a form more conducive to gradient descent: Cost(hθ (x), y ) = −y log(hθ (x)) − (1 − y )log(1 − hθ (x)) So our entire cost function is: J(θ) = − m 1 ∑ [ y (i) log(hθ (x (i) )) + (1 − y (i) )log(1 − hθ (x (i) ))] m i=1 You could use other cost functions for logistic regression, but this one is derived from the principle of maximum likelihood estimation and has the nice property of being convex, so this is the one that basically everyone uses for logistic regression. Then we can calculate mi nθ J(θ) with gradient descent by repeating and simultaneously updating: θj := θj − α m ∑ (hθ (x (i) ) − y (i) ) · xj(i) i=1 This looks exactly the same as the linear regression gradient descent algorithm, but it is different because hθ (x) is now the nonlinear hθ (x) = 1+e1θT x . Still, the previous methods for gradient descent (feature scaling and learning rate adjustment) apply here. 13.2.2 Gradient Descent Gradient descent is an optimization algorithm for finding parameter values which minimize a cost function. Gradient descent perhaps the most common optimization algorithm in machine learning. So we have some cost function J(θ0 , θ1 , . . . , θn ) and we want to minimize it. The general approach is: • Start with some θ0 , θ1 , . . . , θn . • Changing θ0 , θ1 , . . . , θn in some increment/step to reduce J(θ0 , θ1 , . . . , θn ) as much as possible. • Repeat the previous step until convergence on a minimum (hopefully) 316 CHAPTER 13. SUPERVISED LEARNING 317 13.2. OPTIMIZATION Gradient descent algorithm Repeat the following until convergence: (Note that := is the assignment operator.) θj := θj − α ∂ J(θ0 , θ1 , . . . , θn ) ∂θj For each j in n. Every θj is updated simultaneously. So technically, you’d calculate this value for each j in n and only after they are all updated would you actually update each θj . For example, if the right-hand side of that equation was a function func(j, t0, t1), you would implement it like so (example is n = 2): temp_0 = func(0, theta_0, theta_1) temp_1 = func(1, theta_0, theta_1) theta_0 = temp_0 theta_1 = temp_1 α is the learning rate and tells how large a step/increment to change the parameters by. Learning rates which are too small cause the gradient descent to go slowly. Learning rates which are too large can cause the gradient descent to overshoot the minimum, and in those cases it can fail to converge or even diverge. The partial derivative on the right is just the rate of change from the current value. 13.2.3 Normal Equation The normal equation is an approach which allows for the direct determination of an optimal θ without the need for an iterative approach like gradient descent. With calculus, you find the optimum of a function by calculating where its derivatives equal 0 (the intuition is that derivatives are rates of change, when the rate of change is zero, the function is “turning around” and is at a peak or valley). So we can take the same cost function we’ve been using for linear regression and take the partial derivatives of the cost function J with respect to every parameter of θ and then set each of these partial derivatives to 0: m 1 ∑ J(θ0 , θ1 , . . . , θm ) = (hθ (x (i) ) − y (i) )2 2m i=1 And for each j ∂ J(θ) = · · · = 0 ∂θj CHAPTER 13. SUPERVISED LEARNING 317 13.2. OPTIMIZATION 318 Then solve for θ0 , θ1 , . . . , θm . The fast way to do this is to construct a matrix out of your features, including a column for x0 = 1 (so it ends up being an m × (n + 1) dimensional matrix) and then construct a vector out of your target variables y (which is an m-dimensional vector): If you have m examples, (x (1) , y (1) ), . . . , (x (m) , y (m) ), and n features and then include x0 = 1, you have the following feature vectors: x (i) x0(i) (i) x 1 (i) n+1 = x2 ∈ R . .. xn(i) From which we can construct X, known as the design matrix: (x (1) )T (2) T (x ) X= .. . (x (m) T ) That is, the design matrix is composed of the transposes of the feature vectors for all the training examples. Thus a column in the design matrix corresponds to a feature, and each row corresponds to an example. Typically, all examples have the same length, but this may not necessarily be the case. You may have, for instance, images of different dimensions you wish to classify. This kind of data is heterogeneous. And then the vector y is the just all of the labels from your training data: y (1) (2) y y = ... y (m) Then you can calculate the θ vector which minimizes your cost function like so: θ = (X T X)−1 X T y With this method, feature scaling isn’t necessary. Note that it’s possible that X T X is not invertible (that is, it is singular, also called degenerate), but this is usually due to redundant features (e.g. having a feature in feet and in meters; they communicate 318 CHAPTER 13. SUPERVISED LEARNING 319 13.3. PREPROCESSING the same information) or having too many features (e.g. m ≤ n), in which case you should delete some features or use regularization. Programs which calculate the inverse of a matrix often have a method which allows it to calculate the optimal θ vector even if X T X is not invertible. 13.2.4 Deciding between Gradient Descent and the Normal Equation • Gradient Descent – requires that you choose α – needs many iterations – works well when n is large • Normal Equation – don’t need to choose α – don’t need to iterate – slow if n is very large (computing (X T X)−1 has a complexity of O(n3 )), but is usually ok up until around n = 10000 Also note that for some learning algorithms, the normal equation is not applicable, whereas gradient descent still works. 13.2.5 Advanced optimization algorithms There are other advanced optimization algorithms, such as: • Conjugate gradient • BFGS • L-BFGS These shouldn’t be implemented on your own since they require an advanced understanding of numerical computing, even just to understand what they’re doing. They are more complex, but (in the context of machine learning) there’s no need to manually pick a learning rate α and they are often faster than gradient descent. So you can take advantage of them via some library which has them implemented (though some implementations are better than others). 13.3 Preprocessing Prior to applying a machine learning algorithm, data almost always must be preprocessed, i.e. prepared in a way that helps the algorithm perform well (or in a way necessary for the algorithm to work at all). CHAPTER 13. SUPERVISED LEARNING 319 13.3. PREPROCESSING 320 Preprocessing 13.3.1 Feature selection Good features: • lead to data compression • retain relevant information • are based on expert domain knowledge Common mistakes: • trying to automate feature selection • not paying attention to data-specific quirks • throwing away information unnecessarily The model where you include all available features is called the full model. But sometimes including all features can hurt prediction accuracy. There are a few feature selection strategies that can be used. One class of selection strategies is called stepwise selection because they iteratively remove or add one feature at a time, measuring the goodness of fit for each. The two approaches here are the backward-elimination strategy which begins with the full model and removes one feature at a time, and the forward-selection strategy which is the reverse of backward-elimination, starting with one feature and adding the rest one at a time. These two strategies don’t necessarily lead to the same model. 13.3.2 Feature engineering Your data may have features explicitly present, e.g. a column in a database. But you can also design or engineer new features by combining these explicit features or through observing patterns on your own in the data that haven’t yet been explicitly encoded. We’re doing a form of this in polynomial regression above by encoding the polynomials as new features. 320 CHAPTER 13. SUPERVISED LEARNING 321 13.3. PREPROCESSING Representation A very important choice in machine learning is how you represent the data. What are its salient features, and in what form is it best presented? Each field in the data (e.g. column in the table) is a feature and a great deal of time is spent getting this representation right. The best machine learning algorithms can’t do much if the data isn’t represented in a way suited to the task at hand. Sometimes it’s not clear how to represent data. For instance, in identifying an image of a car, you may want to use a wheel as a feature. But how do you define a wheel in terms of pixel values? Representation learning is a kind of machine learning in which representations themselves can be learned. An example representation learning algorithm is the autoencoder. It’s a combination of an encoder function that converts input data into a different representation and a decoder function which converts the new representation back into its original format. Successful representations separate the factors of variations (that is, the contributors to variability) in the observed data. These may not be explicit in the data, “they may exist either as unobserved objects or forces in the physical world that affect the observable quantities, or they are constructs in the human mind that provide useful simplifying explanations or inferred causes of the observed data.” (Deep Learning). Deep Learning Deep learning builds upon representation learning. It involves having the program learn some hierarchy of concepts, such that simpler concepts are used to construct more complicated ones. This hierarchy of concepts forms a deep (many-layered) graph, hence “deep learning”. With deep learning we can have simpler representations aggregate into more complex abstractions. A basic example of a deep learning model is the multilayer perceptron (MLP), which is essentially a function composed of simpler functions (layers); each function (i.e. layer) can be thought of as taking the input and outputting a new representation of it. For example, if we trained a MLP for image recognition, the first layer may end up learning representations of edges, the next may see corners and contours, the next may identify higher level features like faces, etc. 13.3.3 Scaling (normalization) If you design your features such that they are on a similar scale, gradient descent can converge more quickly. For example, say you are developing a model for predicting the price of a house. Your first feature may be the area, ranging from 0-2000 sqft, and your second feature may be the number of bedrooms, ranging from 1-5. These two ranges are very disparate, causing the contours of the cost function to be such that the gradient descent algorithm jumps around a lot trying to find an optimum. CHAPTER 13. SUPERVISED LEARNING 321 13.3. PREPROCESSING 322 If you scale these features such that they share the same (or at least a similar) range, you avoid this problem. More formally, with feature scaling you want to get every feature into approximately a −1 ≤ xi ≤ 1 range (it doesn’t necessarily have to be between -1 and 1, just so long as there is a consistent range across your features). With feature scaling, you could also apply mean normalization, where you replace xi with xi − µi (that is, replace the value of the i th feature with its value minus the mean value for that feature) such that the mean of that feature is shifted to be about zero (note that you wouldn’t apply this to x0 = 1). 13.3.4 Mean subtraction Mean subtraction centers the data around the origin (i.e. it “zero-centers” it), simply by subtracting each feature’s mean from itself. 13.3.5 Dimensionality Reduction Sometimes some of your features may be redundant. You can combine these features in such a way that you project your higher dimension representation into a lower dimension representation while minimizing information loss. With the reduction in dimensionality, your algorithms will run faster. The most common technique for dimensionality reduction is principal component analysis (PCA), although other techniques, such as non-negative matrix factorization (NMF) can be used. Principal Component Analysis (PCA) Say you have some data. This data has two dimensions, but you could more or less capture it in one dimension: Reducing data dimensionality with PCA Most of the variability of the data happens along that axis. This is basically what PCA does. 322 CHAPTER 13. SUPERVISED LEARNING 323 13.3. PREPROCESSING PCA is the most commonly used algorithm for dimensionality reduction. PCA tries to identify a lower-dimensional surface to project the data onto such that the square projection error is minimized. PCA example PCA might project the data points onto the green line on the left. The projection error are the blue lines. Compare to the line on the right - PCA would not project the data onto that line since the projection error is much larger for that line. This example is going from 2D to 1D, but you can use PCA to project from any n-dimension to a lower k-dimension. Using PCA, we find some k vectors and project our data onto the linear subspace spanned by this set of k vectors. Note that this is different than linear regression, though the example might look otherwise. In PCA, the projection error is orthogonal to the line in question. In linear regression, it is vertical to the line. Linear regression also favors the target variable y whereas PCA makes no such distinction. Prior to PCA you should perform mean normalization (i.e. ensure every feature has zero mean) on your features and scale them. First you compute the covariance matrix, which is denoted Σ (same as summation, unfortunately): Σ= n 1 ∑ (x (i) )(x (i) )T m i=1 Then, you compute the eigenvectors of the matrix Σ using singular value decomposition: [U, S, V ] = svd(Σ) The resulting U matrix will be an n × n orthogonal matrix which provides the projected vectors you’re looking for, so take the first k column vectors of U. This n × k matrix can be called Ureduce , which CHAPTER 13. SUPERVISED LEARNING 323 13.3. PREPROCESSING 324 you then transpose to get these vectors as rows, resulting in a k × n matrix which you then multiply by your feature matrix. So how do you choose k, the number of principal components? One way to choose k is so that most of the variance is retained. If the average squared projection error (which is what PCA tries to minimize) is: m 1 ∑ (i) ||x (i) − xapprox ||2 m i=1 And the total variation in the data is given by: m 1 ∑ ||x (i) ||2 m i=1 Then you would choose the smallest value of k such that: 1 m ∑m (i) (i) − xapprox ||2 i=1 ||x 1 ∑m (i) 2 i=1 ||x || m ≤ 0.01 That is, so that 99% of variance is retained. This procedure for selecting k is made much simpler if you use the S matrix from the svd(Σ) function. The S matrix’s only non-zero values are along its diagonal, S11 , S22 , . . . , Snn . Using this you can instead just calculate: ∑k Sii ≤ 0.01 i=1 Sii 1 − ∑i=1 n Or, to put it another way: ∑k Sii ≥ 0.99 i=1 Sii i=1 ∑n In practice, you can reduce the dimensionality quite drastically, such as by 5 or 10 times, such as from 10,000 features to 1,000, and retain variance. But you should not use PCA prematurely - first try an approach without it, then later you can see if it helps. The process of using principal component analysis (PCA) to reduce dimensionality of data is called factor analysis. In factor analysis, the retained principal components are called common factors and their correlations with the input variables are called factor loadings. 324 CHAPTER 13. SUPERVISED LEARNING 325 13.4. LINEAR REGRESSION PCA becomes more reliable the more data you have. The number of examples must be larger than the number of variables in the input matrix. The assumptions of linear correlation must hold as well (i.e. that the variables must be linearly related). PCA Whitening You can go a step further with the resulting U matrix (with only the k chosen components) with PCA whitening, which can improve the training process. PCA whitening is used to decorrelate features and equalize the variance of the features. Thus the first step is to decorrelate the original data X, which is accomplished by rotating it: Xrotated = U · X Then the data is normalized to have a variance of 1 for all of its components. To do so we just divide each component by the square root of its eigenvalue. An epsilon value is included to prevent division by zero: Xrotated Xwhitened = √ (S + ϵ) 13.3.6 Bagging (“Bootstrap aggregating”) Basic idea: Generate more data from your existing data by resampling Bagging (stands for Bootstrap Aggregation) is the way decrease the variance of your prediction by generating additional data for training from your original dataset using combinations with repetitions to produce multisets of the same cardinality/size as your original data. By increasing the size of your training set you can’t improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to expected outcome. - http://stats.stackexchange.com/a/19053/55910 13.4 Linear Regression 13.4.1 Univariate (simple) Linear Regression Univariate linear regression or simple linear regression (SLR) is linear regression with a single variable. In univariate linear regression, we have one input variable x. The hypothesis takes the form: CHAPTER 13. SUPERVISED LEARNING 325 13.4. LINEAR REGRESSION 326 hθ (x) = θ0 + θ1 x Where the θi s are the parameters that the learning algorithm learns. This should look familiar: it’s just a line. 13.4.2 How are the parameters determined? The general idea is that you want to choose your parameters so that hθ (x) is close to y for your training examples (x, y ). This can be written: m ∑ (hθ (x (i) ) − y (i) )2 i=1 To the math easier, you multiply everything by 1 2m (this won’t affect the resulting parameters): m 1 ∑ (hθ (x (i) ) − y (i) )2 2m i=1 This is the cost function (or objective function). In this case, we call it J, which looks like: J(θ0 , θ1 ) = m 1 ∑ (hθ (x (i) ) − y (i) )2 2m i=1 Here it is the squared error function - it is probably the most commonly used cost function for regression problems. The squared error loss function is not the only loss function available. There are a variety you can use, and you can even come up with your own if needed. Perhaps, for instance, you want to weigh positive errors more than negative errors. We want to find (θ0 , θ1 ) to minimize J(θ0 , θ1 ). Gradient Descent for Univariate Linear Regression For univariate linear regression, the derivatives are: m ∂ 1 ∑ J(θ0 , θ1 ) = (hθ (x (i) ) − y (i) ) ∂θ0 m i=1 m ∂ 1 ∑ J(θ0 , θ1 ) = (hθ (x (i) ) − y (i) ) · x (i) ∂θ1 m i=1 so overall, the algorithm involves repeatedly updating: 326 CHAPTER 13. SUPERVISED LEARNING 327 13.4. LINEAR REGRESSION An example cost function with two parameters The same cost function, visualized as a contour plot CHAPTER 13. SUPERVISED LEARNING 327 13.4. LINEAR REGRESSION 328 θ0 := θ0 − α m 1 ∑ (hθ (x (i) ) − y (i) ) m i=1 θ1 := θ1 − α m 1 ∑ (hθ (x (i) ) − y (i) ) · x (i) m i=1 Remember that the θ parameters are updated simultaneously. Note that because we are summing over all training examples for each step, this particular type of gradient descent is known as batch gradient descent. There are other approaches which only sum over a subset of the training examples for each step. Univariate linear regression’s cost function is always convex (“bowl-shaped”), which has only one optimum, so gradient descent int his case will always find the global optimum. A convex function 13.4.3 Multivariate linear regression Multivariate linear regression is simply linear regression with multiple variables. This technique is for using multiple features with linear regression. Say we have: • n = number of features • x (i) = the input features of the i th training example • xj(i) = the value of feature j in the i th training example Instead of the simple linear regression model we can use a generalized linear model (GLM). That is, the hypothesis h will take the form of: hθ (x) = θ0 + θ1 x1 + θ2 x2 + · · · + θn xn 328 CHAPTER 13. SUPERVISED LEARNING 329 13.4. LINEAR REGRESSION For convenience of notation, you can define x0 = 1 and notate your features and parameters as zero-indexed n + 1-dimensional vectors: x0 θ0 x1 θ1 x = x2 , θ = θ2 . . .. .. xn θn And the hypothesis can be re-written as: hθ x = θ T x Sometimes in multiple regression you may have predictor variables which are correlated with one another; we say that these predictors are collinear. Gradient descent with Multivariate Linear Regression The previous gradient descent algorithm for univariate linear regression is just generalized (this is still repeated and simultaneously updated): θj := θj − α 13.4.4 m 1 ∑ (hθ (x (i) ) − y (i) ) · xj(i) m i=1 Example implementation of linear regression with gradient descent ””” - X = feature vectors - y = labels/target variable - theta = parameters - hyp = hypothesis (actually, the vector computed from the hypothesis function) ””” import numpy as np def cost_function(X, y, theta): ””” This isn’t used, but shown for clarity ””” m = y.size hyp = np.dot(X, theta) CHAPTER 13. SUPERVISED LEARNING 329 13.4. LINEAR REGRESSION 330 sq_err = sum(pow(hyp - y, 2)) return (0.5/m) * sq_err def gradient_descent(X, y, theta, alpha=0.01, iterations=10000): m = y.size for i in range(iterations): hyp = np.dot(X, theta) for i, p in enumerate(theta): temp = X[:,i] err = (hyp - y) * temp cost_function_derivative = (1.0/m) * err.sum() theta[i] = theta[i] - alpha * cost_function_derivative return theta if __name__ == ’__main__’: def true_function(X): # Create random parameters for X’s dimensions, plus one for x0. true_theta = np.random.rand(X.shape[1] + 1) return true_theta[0] + np.dot(true_theta[1:], X.T), true_theta # Create some random data n_samples = 20 n_dimensions = 5 X = np.random.rand(n_samples, n_dimensions) y, true_theta = true_function(X) # Add a column of 1s for x0 ones = np.ones((n_samples, 1)) X = np.hstack([ones, X]) # Initialize parameters theta = np.zeros((n_dimensions+1)) # Split data X_train, y_train = X[:-1], y[:-1] X_test, y_test = X[-1:], y[-1:] # Estimate parameters theta = gradient_descent(X_train, y_train, theta, alpha=0.01, iterations=10000) # Predict print(’true’, y_test) print(’pred’, np.dot(X_test, theta)) 330 CHAPTER 13. SUPERVISED LEARNING 331 13.5. LOGISTIC REGRESSION print(’true theta’, true_theta) print(’pred theta’, theta) 13.4.5 Outliers Outliers can pose a problem for fitting a regression line. Outliers that fall horizontally away from the rest of the data points can influence the line more, so they are called points with high leverage. Any such point that actually does influence the line’s slope is called an influential point. You can examine this effect by removing the point and then fitting the line again and seeing how it changes. Outliers should only be removed with good reason - they can still be useful and informative and a good model will be able to capture them in some way. 13.4.6 Polynomial Regression Your data may not fit a straight line and might be better described by a polynomial function, e.g. θ0 + θ1 x + θ2 x 2 or θ0 + θ1 x + θ2 x 2 + θ3 x 3 . A trick to this is that you can write this in the form of plain old multivariate linear regression. You would, for example, just treat x as a feature x1 , x 2 as another feature x2 , x 3 as another feature x3 , and so on: θ0 + θ1 x1 + θ2 x2 + θ3 x3 + · · · + θn xn Note that in situations like this, feature scaling is very important because these features’ ranges differ by a lot due to the exponents. 13.5 Logistic Regression Logistic regression is a common approach to classification. The name “regression” may be a bit confusing - it is a classification algorithm, though it returns a continuous value. In particular, it returns the probability of the positive class; if that probability is ≥ 0.5, then the positive label is returned. Logistic regression outputs a value between zero and one (that is, 0 ≤ hθ (x) ≤ 1). Say we have our hypothesis function hθ (x) = θT x With logistic regression, we apply an additional function g : hθ (x) = g(θT x) CHAPTER 13. SUPERVISED LEARNING 331 13.5. LOGISTIC REGRESSION 332 where g(z) = 1 1 + e −z This function is known as the sigmoid function, also known as the logistic function, with the form: The sigmoid or logistic function So in logistic regression, the hypothesis ends up being: hθ (x) = 1 1 + e θT x The output of the hypothesis is interpreted as the probability of the given input belonging to the positive class. that is: hθ (x) = P (y = 1|x; θ) Which is read: “the probability that y = 1 given x as parameterized by θ”. Since we are classifying input, we want to output a label, not a continuous value. So we might say y = 1 if hθ (x) ≥ 0.5 and y = 0 if hθ (x) < 0.5. The line that forms this divide is an example of 332 CHAPTER 13. SUPERVISED LEARNING 333 13.5. LOGISTIC REGRESSION a decision boundary. Note that decision boundaries can be non-linear as well (e.g. they could be a circle or something). more on Logistic Regression Logistic regression is also a GLM - you’re fitting a line which models the probability of being in the positive class. We can use the Bernoulli distribution since it models events with two possible outcomes and is parameterized by only the probability of the positive outcome, p. Thus our line would look something like: pi = β0 + β1 x1 + · · · + βn xn + ϵ But to represent a probability, y values must be bound to [0, 1]. Currently, our model can be linear or polynomial and thus can output any continuous value. So we have to apply a transformation to constrain y ; we do so by applying a logit transformation: logi t(p) = log( The p 1−p p )=x 1−p term constraints the output to be positive. The log operation constrains the values to [0, 1]. The inverse of the logit transformation is: p= 1 1 + exp(−x) So the model is now: logi t(p) = β0 + β1 x1 + · · · + βn xn ϵ So the likelihood here is: L(y |p) = n ∏ piyi (1 − pi )1−yi i=1 And the log likelihood then is: l(y |p) = n ∑ yi log(pi ) + (1 − yi ) log(1 − pi ) i=1 CHAPTER 13. SUPERVISED LEARNING 333 13.5. LOGISTIC REGRESSION 334 even more on Logistic Regression Linear regression is good for explaining continuous dependent variables. But for discrete variables, linear regression gives ambiguous results - what does a fractional result mean? It can’t be interpreted as a probability because linear regression models are not bound to [0, 1] as probability functions must be. When dealing with boolean/binary dependent variables you can use logistic regression. When dealing with non-binary discrete dependent variables, you can use Poisson regression (which is a GLM that uses the log link function). So we expect the logistic regression function to output a probability. In linear regression, the model can output any value, not bound to [0, 1]. So for logistic regression we apply a transformation, most commonly the logit transformation, so that our resulting values can be interpreted as probability: transformation(p) = β0 + β1 x1 + · · · + βn xn p logit(p) = loge ( ) 1−p So if we solve the original regression equation for p, we end up with: p= e β0 +β1 x1 +···+βn xn 1 + e β0 +β1 x1 +···+βn xn Logistic regression does not have a closed form solution - that is, it can’t be solved in a finite number of operations, so we must estimate its parameters using other methods, more specifically, we use iterative methods. Generally the goal is to find the maximum likelihood estimate (MLE), which is the set of parameters that maximizes the likelihood of the data. So we might start with random guesses for the parameters, the compute the likelihood of our data (that is, we can compute the probability of each data point; the likelihood of the data is the product of these individual probabilities) based on these parameters. We iterate until we find the parameters which maximize this likelihood. 13.5.1 One-vs-All The technique of one-vs-all (or one-vs-rest) involves dividing your training set into multiple binary classification problems, rather than as a single multiclass classification problem. For example, say you have three classes 1,2,3. Your first binary classifier will distinguish between class 1 and classes 2 and 3„ your second binary classifier will distinguish between class 2 and classes 1 and 3, and your final binary classifier will distinguish between class 3 and classes 1 and 2. Then to make the prediction, you pick the class i which maximizes maxi hθ(i) (x). 334 CHAPTER 13. SUPERVISED LEARNING 335 13.6. SOFTMAX REGRESSION 13.6 Softmax regression Softmax regression generalizes logistic regression to beyond binary classification (i.e. multinomial classification; that is, there are more than just two possible classes). Logistic regression is the reduced form of softmax regression where k = 2 (thus logistic regression is sometimes caled a “binary Softmax classifier”). As is with logistic regression, softmax regression outputs probabilities for each class. As a generalization of logistic regression, softmax regression can also be expressed as a generalized linear model. It generally uses a cross-entropy loss function. 13.6.1 Hierarchical Softmax In the case of many, many classes, the hierarchical variant of softmax may be preferred. In hierarchical softmax, the labels are structured as a hierarchy (a tree). A Softmax classifier is trained for each node of the tree, to distinguish the left and right branches. 13.7 Generalized linear models (GLMs) There is a class of machine learning models known as generalized linear models (GLMs) because they are expressed as a linear combination of parameters, i.e. ŷ = θ0 + θ1 x1 + · · · + θn xn We can use linear models for non-regression situations, as we do with logistic regression - that is, when the output variable is not an unbounded continuous value directly computed from the inputs (that is, the output variable is not a linear function of the inputs), such as with binary or other kinds of classification. In such cases, the linear models we used are called generalized linear models. Like any linear function, we get some value from our inputs, but we then also apply a link function which transforms the resulting value into something we can use. Another way of putting it is that these link functions allow us to generalize linear models to other situations. Linear regression also assumes homoscedasticity; that is, that the variance of the error is uniform along the line. GLMs do not need to make this assumption; the link function transforms the data to satisfy this assumption. For example, say you want to predict whether or not someone will buy something - this is a binary classification and we want either a 0 or a 1. We might come up with some linear function based on income and number of items purchased in the last month, but this won’t give us a 0/no or a 1/yes, it will give us some continuous value. So then we apply some link function of our choosing which turns the resulting value to give us the probability of a 1/yes. Linear regression is also a GLM, where the link function is the identity function. Logistic regression uses the logit link function. CHAPTER 13. SUPERVISED LEARNING 335 13.7. GENERALIZED LINEAR MODELS (GLMS) 336 Logistic regression is a type of models called generalized linear models (GLM), which involves two steps: 1. Model the response variable with a probability distribution. 2. Model the distribution’s parameters using the predictor variables and a special form of multiple regression. This probability distribution is taken from the exponential family of probability distributions, which includes the normal, Bernoulli, beta, gamma, Dirichlet, and Poisson distributions (among others). A distribution is in the exponential family if it can be written in the form: P (y |n) = b(y ) exp(η T T (y ) − a(η)) η is known as the natural parameter or the canonical parameter of the distribution, T (y ) is the sufficient statistics, which is often just T (y ) = y . a(η) is the log partition function. We can set T, a, b to define a family of distributions; this family is parameterized by η, with different values giving different distributions within the family. For instance, the Bernoulli distribution is in the exponential family, where p η = log( 1 − p)) ( T (y ) = y a(η) = − log(1 − p) b(y ) = 1 Same goes for the Gaussian distribution, where η=µ T (y ) = y µ2 a(η) = 2 −y 2 1 exp( b(y ) = √ ) 2 (2π) Note that with linear models, you should avoid extrapolation, that is, estimating values which are outside the original data’s range. For example, if you have data in some range [x1 , xn ], you have no guarantee that your model behaves correctly at x < x1 and x > xn . 13.7.1 Linear Mixed Models (Mixed Models/Hierarchical Linear Models) In a linear model there may be mixed effects, which includes fixed and random effects. Fixed effects are variables in your model where their coefficients are fixed (non-random). Random effects are variables in your model where their coefficients are random. 336 CHAPTER 13. SUPERVISED LEARNING 337 13.7. GENERALIZED LINEAR MODELS (GLMS) For example, say you want to create a model for crop yields given a farm and amount of rainfall. We have data from several years and the same farms are represented multiple times throughout. We could consider that some farms may be better at producing greater crop yields given the same amount of rainfall as another farm. So we expect that samples from different farms will have different variances - e.g. if we look at just farm A’s crop yields, that sample would have different variance than if we just looked at farm B’s crop yields. In this regard, we might expect that models for farm A and farm B will be somewhat different. The naive approach would be to just ignore differences between farms and consider only rainfall as a fixed effect (i.e. with a fixed/constant coefficient). This is sometimes called “pooling” because we’ve lumped everything (in our case, all the farms) together. We could create individual models for each farm (“no pooling”) but perhaps for some farms we only have one or two samples. For those farms, we’d be building very dubious models since their sample sizes are so small. The information from the other farms are still useful for giving us more data to work with in these cases, so no pooling isn’t necessarily a good approach either. We can use a mixed model (“partial pooling”) to capture this and make it so that the rainfall coefficient random, varying by farm. more…from another source We may run into situations like the following: A situation where an HLM might be better Where our data seems to encompass multiple models (the red, green, blue, and black ones going up from left to right), but if we try to model them all simultaneously, we get a complete incorrectly CHAPTER 13. SUPERVISED LEARNING 337 13.7. GENERALIZED LINEAR MODELS (GLMS) 338 model (the dark grey line going down from left to right). Each of the true lines (red, green, blue, black) may come from distinct units, i.e. each could represent a different part of the day or a different US state, etc. When there are different effects for each unit, we say that there is unit heterogeneity in the data. In the example above, each line has a different intercept. But the slopes could be different, or both the intercepts and slopes could be different: Varying slopes and intercepts In this case, we use a random-effects model because some of the coefficients are random. For instance, in the first example above, the intercepts varied, in which case the intercept coefficient would be replaced with a random variable αi drawn from the normal distribution: y = αi + β i x + ϵ Or in the case of the slopes varying, we’d say that β i is a random variable drawn from the normal distribution. In each case, α is the mean intercept and β is the mean slope. When both slope and intercept vary, we draw them together from a multivariate normal distribution since they may have some relation, i.e. [ ] [ ] αi α ∼ Φ( , Σ) βi β Now consider when there are multiple levels of these effects that we want to model. For instance, perhaps there are differences across US states but also differences across US regions. In this case, we will have a hierarchy of effects. Let’s say only the intercept is affected - if we wanted to model the effects of US regions and US states on separate levels, then the αi will be drawn from a distribution according to the US region, αi ∼ Φ(µregion , σα2 ), and then the regional mean which parameterizes αi ’s distribution is drawn from a distribution of regional means, µregion ∼ Φ(µ, σr2 ). 338 CHAPTER 13. SUPERVISED LEARNING 339 13.8. SUPPORT VECTOR MACHINES 13.8 Support Vector Machines SVMs can be powerful for learning non-linear functions and are widely-used. With SVMs, the optimization objective is: minθ m ∑ [y (i) cost1 (θT x (i) ) + (1 − y (i) )cost0 (θT x (i) )] + i=1 n λ∑ θ2 2 j=1 j Where the term at the end is the regularization term. Note that this is quite similar to the objective function for logistic regression; we have just removed the m1 term (removing it does not make a difference to our result because it is a constant) and substituted the log hypothesis terms for two new cost functions. If we break up the logistic regression objective function into terms (that is, the first sum and the regularization term), we might write it as A + λB. The SVM objective is often instead notated by convention as CA + B. You can think of C as λ1 . That is, where increasing λ brings your parameters closer to zero, the regularization parameter C has the opposite effect - as it grows, so do your parameters, and vice versa. With that representation in mind, we can rewrite the objective by replacing the λ with C on the first term: minθ C m ∑ [y (i) cost1 (θT x (i) ) + (1 − y (i) )cost0 (θT x (i) )] + i=1 n 1∑ θ2 2 j=1 j The SVM hypothesis is: 1 hθ (x) = if θT x ≥ 0 0 otherwise SVMs are sometimes called large margin classifiers. Take the following data: On the left, a few different lines separating the data are drawn. The optimal one found by SVM is the one in orange. It is the optimal one because it has the largest margins, illustrated by the red lines on the right (technically, the margin is orthogonal from the decision boundary to those red lines). When C is very large, SVM tries to maximize these margins. However, outliers can throw SVM off if your regularization parameter C is too large, so in those cases, you may want to try a smaller value for C. 13.8.1 Kernels Kernels are the main technique for adapting SVMs to do complex non-linear classification. CHAPTER 13. SUPERVISED LEARNING 339 13.8. SUPPORT VECTOR MACHINES 340 SVM and margins 340 CHAPTER 13. SUPERVISED LEARNING 341 13.8. SUPPORT VECTOR MACHINES A note on notation. Say your hypothesis looks something like: θ0 + θ1 x1 + θ2 x2 + θ3 x1 x2 + θ4 x12 + . . . We can instead notate each non-parameter term as a feature f , like so: θ0 + θ1 f1 + θ2 f2 + θ3 f3 + θ4 f4 + . . . For SVMs, how do we choose these features? What we can do is compute features based on x’s proximity to landmarks l (1) , l (2) , l (3) , . . . . For each landmark, we get a feature: fi = similarity(x, l (i) ) = exp(− ||x − l (i) ||2 ) 2σ 2 Here, the similarity(x, l (i) ) function is the kernel, sometimes just notated k(x, l (i) ). We have a choice in what kernel function we use; here we are using Gaussian kernels. In the Gaussian kernel we have a parameter σ. If x is close to l (i) , then we expect fi ≈ 1. Conversely, if x is far from l (i) , then we expect fi ≈ 0. With this approach, classification becomes based on distances to the landmarks - points that are far away from certain landmarks will be classified 0, points that are close to certain landmarks will be classified 1. And thus we can get some complex decision boundaries like so: An example SVM decision boundary So how do you choose the landmarks? You can take each training example and place a landmark there. So if you have m training examples, you will have m landmarks. CHAPTER 13. SUPERVISED LEARNING 341 13.8. SUPPORT VECTOR MACHINES 342 So given (x (1) , y (1) ), (x (2) , y (2) ), . . . , (x (m) , y (m) ), choose l (1) = x (1) , l (2) = x (2) , . . . , l (m) = x (m) . Then given a training example (x (i) , y (i) ), we can compute a feature vector f , where f0 = 1, like so: f0 = 1 = sim(x (i) , l (1) ) = sim(x (i) , l (2) ) .. . (i) f 1 (i) f2 f = (i) (i) (i) f i = si m(x , l ) .. . fm(i) = si m(x (i) , l (m) ) Then instead of x we use our feature vector f . So our objective function becomes: mi nθ C m ∑ (i) T [y cost1 (θ f (i) ) + (1 − y )cost0 (θ f (i) T i=1 (i) n 1∑ )] + θj2 2 j=1 Note that here n = m because we have a feature for each of our m training examples. Of course, using a landmark for each of your training examples makes SVM difficult on large datasets. There are some implementation tricks to make it more efficient, though. When choosing the regularization parameter C, note that: • A large C means lower bias, high variance • A small C means higher bias, low variance For the Gaussian kernel, we also have to choose the parameter σ 2 . • A large σ 2 means that features fi vary more smoothly. Higher bias, lower variance. • A small σ 2 means that features fi vary less smoothly. Lower bias, higher variance. When using SVM, you also need to choose a kernel, which could be the Gaussian kernel, or it could be no kernel (i.e. a linear kernel), or it could be one of many others. The Gaussian and linear kernels are by far the most commonly used. You may want to use a linear kernel if n is very large, but you don’t have many training examples (m is small). Something more complicated may overfit if you only have a little data. The Gaussian kernel is appropriate if n is small and/or m is large. Note that you should perform feature scaling before using the Gaussian kernel. Not all similarity functions make valid kernels - they must satisfy a condition called Mercer’s Theorem which allows the optimizations that most SVM implementations provide and also so they don’t diverge. Other off-the-shelf kernels include: 342 CHAPTER 13. SUPERVISED LEARNING 343 13.8. SUPPORT VECTOR MACHINES • Polynomial kernel: k(x, l) = (x T l)2 , or k(x, l) = (X T l)3 , or k(x, l + 1)3 , etc (there are many variations), the general form is (x T l + constant)degree . It usually performs worse than the Gaussian kernel. • More esoteric ones: String kernel, chi-square kernel, histogram intersection kernel, … But these are seldom, if ever, used. Some SVM packages have a built-in multi-class classification functionality. Otherwise, you can use the one-vs-all method. That is, train K SVMs, one to distinguish y = i from the rest, for i = 1, 2, . . . , K, then get θ(1) , θ(2) , . . . , θ(K) , and pick classs i with the largest (θ(i) )T x. If n is large relative to m, e.g. n = 10000, m ∈ [10, 1000], then it may be better to use logistic regression, or SVM without a kernel (linear kernel). If n is small (1-1000) and m is intermediate (10-50000), then you can try SVM with the Gaussian kernel. If n is small (1-1000) but m is large (50000+), then you can create more features and then use logistic regression or SVM without a kernel, since otherwise SVMs struggle at large training sizes. SVM without a kernel works out to be similar to logistic regression for the most part. Neural networks are likely to work well for most of these situations, but may be slower to train. The SVM’s optimization problem turns out to be convex, so good SVM packages will find global minimum or something close to it (so no need to worry about local optima). Other rules of thumb: • Use linear kernel when number of features is larger than number of observations. • Use gaussian kernel when number of observations is larger than number of features. • If number of observations is larger than 50,000 speed could be an issue when using gaussian kernel; hence, one might want to use linear kernel. Source Also: Usually, the decision is whether to use linear or an RBF (aka Gaussian) kernel. There are two main factors to consider: Solving the optimisation problem for a linear kernel is much faster, see e.g. LIBLINEAR. Typically, the best possible predictive performance is better for a nonlinear kernel (or at least as good as the linear one). It’s been shown that the linear kernel is a degenerate version of RBF, hence the linear kernel is never more accurate than a properly tuned RBF kernel. Quoting the abstract from the paper I linked: The analysis also indicates that if complete model selection using the Gaussian kernel has been conducted, there is no need to consider linear SVM. CHAPTER 13. SUPERVISED LEARNING 343 13.8. SUPPORT VECTOR MACHINES 344 A basic rule of thumb is briefly covered in NTU’s practical guide to support vector classification (Appendix C). If the number of features is large, one may not need to map data to a higher dimensional space. That is, the nonlinear mapping does not improve the performance. Using the linear kernel is good enough, and one only searches for the parameter C. Your conclusion is more or less right but you have the argument backwards. In practice, the linear kernel tends to perform very well when the number of features is large (e.g. there is no need to map to an even higher dimensional feature space). A typical example of this is document classification, with thousands of dimensions in input space. In those cases, nonlinear kernels are not necessarily significantly more accurate than the linear one. This basically means nonlinear kernels lose their appeal: they require way more resources to train with little to no gain in predictive performance, so why bother. TL;DR Always try linear first since it is way faster to train (AND test). If the accuracy suffices, pat yourself on the back for a job well done and move on to the next problem. If not, try a nonlinear kernel. Source 13.8.2 more on support vector machines Support vector machines is another way of coming up with decision boundaries to divide a space. Here the decision boundary is positioned so that its margins are as wide as possible. We can consider some vector w ⃗ which is perpendicular to the decision boundary and has an unknown length. Then we can consider an unknown vector u⃗ that we want to classify. ⃗ · u⃗, and see if it is greater than or equal to some constant c. We can compute their dot product, w To make things easier to work with mathematically, we set b = −c and rewrite this as: w ⃗ · u⃗ + b ≥ 0 This is our decision rule: if this inequality is true, we have a positive example. Now we will define a few things about this system: w ⃗ ·⃗ x+ + b ≥ 1 w ⃗ ·⃗ x− + b ≤ −1 Where ⃗ x+ is a positive training example and ⃗ x− is a negative training example. So we will insist that these inequalities hold. 344 CHAPTER 13. SUPERVISED LEARNING 345 13.8. SUPPORT VECTOR MACHINES Support Vector Machines CHAPTER 13. SUPERVISED LEARNING 345 13.8. SUPPORT VECTOR MACHINES 346 For mathematical convenience, we will define another variable yi like so: yi = +1 yi = y = −1 i if positive example if negative example So we can rewrite our constraints as: yi ( w ⃗ ·⃗ x+ + b) ≥ 1 yi ( w ⃗ ·⃗ x− + b) ≥ 1 Which ends up just collapsing into: yi ( w ⃗ ·⃗ x + b) ≥ 1 Or: yi ( w ⃗ ·⃗ x + b) − 1 ≥ 0 We then add an additional constraint for an xi in the gutter (that is, within the margin of the decision boundary): yi ( w ⃗ ·⃗ x + b) − 1 = 0 So how do you compute the total width of the margins? You can take a negative example ⃗ x− and a positive example ⃗ x+ and compute their difference ⃗ x+ − ⃗ x− . This resulting vector is not orthogonal to the decision boundary, so we can project it onto the unit vector ŵ (the unit vector of the w ⃗ , which is orthogonal to the decision boundary): width = (⃗ x+ − ⃗ x− ) · w ⃗ ||w ⃗ || Using our previous constraints we get ⃗ x+ = 1 − b and −⃗ x− = 1 + b, so the end result is: width = 2 ||w ⃗ || We want to maximize the margins, that is, we want to maximize the width, and we can divide by 12 because we still have a meaningful maximum, and that in turn can be interpreted as the minimum ⃗ , which we can rewrite in a more mathematically convenient form (and still have of the length of w the same meaningful minimum): 346 CHAPTER 13. SUPERVISED LEARNING 347 13.8. SUPPORT VECTOR MACHINES Support Vector Machines CHAPTER 13. SUPERVISED LEARNING 347 13.8. SUPPORT VECTOR MACHINES max( 348 2 1 1 ) → max( ) → min(||w ⃗ ||) → mi n( ||w ⃗ ||2 ) ||w ⃗ || ||w ⃗ || 2 Let’s turn this into something we can maximize, incorporating our constraints. We have to use Lagrange multipliers which provide us with this new function we can maximize without needing to think about our constraints anymore: L= ∑ 1 ||w ⃗ ||2 − αi [yi (w ⃗ ·⃗ xi + b) − 1] 2 i (Note that the Lagrangian is an objective function which includes equality constraints). Where L is the function we want to maximize, and the sum is the sum of the constraints, each with a multiplier αi . So then to get the maximum, we just compute the partial derivatives and look for zeros: ∑ ∑ ∂L =w ⃗− α i yi ⃗ xi = 0 → w ⃗= α i yi ⃗ xi ∂w ⃗ i i ∑ ∑ ∂L =− α i yi = 0 → α i yi = 0 ∂b i i Let’s take these partial derivatives and re-use them in the original Lagrangian: L= ∑ ∑ ∑ ∑ ∑ 1 ∑ ( α i yi ⃗ xi ) · ( αj yj ⃗ xj ) − αi yi ⃗ xi · ( αj yj ⃗ xj ) − αi yi b + αi 2 i j i j Which simplifies to: L= ∑ αi − 1 ∑∑ α i α j yi yj ⃗ xi · ⃗ xj 2 i j We see that this depends on ⃗ xi · ⃗ xj . Similarly, we can rewrite our decision rule, substituting for w ⃗. w ⃗= ∑ α i yi ⃗ xi i ∑ w ⃗ · u⃗ + b ≥ 0 αi yi ⃗ xi · u⃗ + b ≥ 0 i And similarly we see that this depends on ⃗ xi · u⃗. The nice thing here is that this works in a convex space (proof not shown) which means that it cannot get stuck on a local maximum. 348 CHAPTER 13. SUPERVISED LEARNING 349 13.8. SUPPORT VECTOR MACHINES Sometimes you may have some training data ⃗ x which is not linearly separable. What you need is a transformation, ϕ(⃗ x ) to take the data from its current space to a space where it is linearly separable. Since the maximization and the decision rule depend only on the dot products of vectors, we can just substitute the transformation, so that: • we want to maximize ϕ(⃗ xi ) · ϕ(⃗ xj ) • for the decision rule, we have ϕ(⃗ xi ) · ϕ(⃗ u) Since these are just dot products between the transformed vectors, we really only need a function which gives us that dot product: K(⃗ xi , ⃗ xj ) = ϕ(⃗ xi ) · ϕ(⃗ xj ) This function K is called the kernel function. So if you have the kernel function, you don’t even need to know the specific transformation - you just need the kernel function. Some popular kernels: • linear kernel: K(⃗ u, ⃗ v ) = (⃗ u·⃗ v + 1)n • radial basis kernel: e − ||⃗ xi −⃗ xj || σ More on kernels: The Kernel Trick Many machine learning algorithms can be written in the form: wTx + b = b + m ∑ αi x T x (i) i=1 Where α is a vector of coefficients. We can substitute x with the output of a feature function ϕ(x) and the dot product x T x (i) with a function k(x, x (i) ) = ϕ(x)T ϕ(x (i ). This function k is called a kernel. Thus we are left with: f (x) = b + m ∑ αi k(x, x (i) ) i=1 ϕ maps x to a linear space, so this final function f is linear; as such, x can be non-linear. Machine learning algorithms which use this trick are called kernel methods or kernel machines. CHAPTER 13. SUPERVISED LEARNING 349 13.9. DECISION TREES 350 13.9 Decision Trees Basic algorithm: 1. Start with data all in one group 2. Find some criteria which best splits the outcomes 3. Divide the data into two groups (which become the leaves) on that split (which becomes a node) 4. Within each split, repeat 5. Repeat until the groups are too small or are sufficiently “pure” (homogeneous) Classification trees are non-linear models: • They use interactions b/w variables • Data transformations may be less important (monotone transformations probably won’t affect how data is split) • Trees can be used for regression problems (continuous outcome) 13.9.1 Measures of impurity p̂mk = 1 Nm ∑ ⊮(yi = k) xi ∈Leafm That is, within the m leaf you have Nm objects to consider and you count the number of a particular class k in that set of objects and divide it by Nm to get the probability p̂mk . • misclassification error: 1 − p̂mk(m) ; k(m) = most; common; k – 0 = perfect purity – 0.5 = no purity • Gini index: ∑ k̸=k ′ p̂mk × p̂mk ′ = ∑K k=1 p̂mk (1 − p̂mk ) = 1 − ∑K k=1 2 pmk – 0 = perfect purity – 0.5 = no purity • Deviance/information gain: − ∑K k=1 p̂mk log2 p̂mk – 0 = perfect purity – 1 = no purity 350 CHAPTER 13. SUPERVISED LEARNING 351 13.9.2 13.10. ENSEMBLE MODELS Random forests Random forests are the ensemble model version of decision trees. Basic idea: 1. Bootstrap samples (i.e. resample) 2. At each split in the tree, bootstrap the variables (i.e. only a subset of the variables is considered at each split) 3. Grow multiple trees 4. Each tree votes on a classification This can be very accurate but slow, prone to overfitting (cross-validation helps though), and not easy to interpret. However, they generally perform very well. 13.9.3 Classification loss functions Hinge Loss (aka Max-Margin Loss) The hinge loss function takes the form ℓ(y ) = max(0, 1 − t · y ) and is typically used for SVMs (sometimes squared hinge loss is used, which is just the previous equation squared). Cross-entropy loss L(y , ŷ ) = − 1 ∑∑ yn,i log ŷn,i N n∈N i∈C where • N = number of samples • C = number of classes Typically used for Softmax classifiers. ( e fyi Li = − log ∑ fj j e 13.10 13.10.1 ) Ensemble models Boosting Basically, taking many models and combining their outputs as a weighted average. Basic idea: CHAPTER 13. SUPERVISED LEARNING 351 13.10. ENSEMBLE MODELS 352 1. Take lots of (possibly) weak predictors h1 , . . . , hk , e.g. a bunch of different trees or regression models or different cutoffs. 2. Weight them and combine them by creating a classifier which combine the predictors: f (x) = ∑ sign( Tt=1 αt ht (x)) • • • • • Goal is to minimize error on training set Iteratively select a classifier h at each step Calculate weights based on errors Increase the weight of missed classifications and select the next classifier The sign of the result tells you the class Adaboost is a popular boosting algorithm. One class of boosting is gradient boosting. Boosting typically does very well. more on boosting Here we focus on binary classification. Say we have a classifier h which produces +1 or −1. We have some error rate, which ranges from 0 to 1. A weak classifier is one where the error is just less than 0.5 (that is, it works slightly better than chance). A stronger classifier has an error rate closer to 0. Let’s say we have several weak classifiers, h1 , . . . , hn . We can combine them into a bigger classifier, H(x), where x is some input, which is the sum of the individual weak classifiers, and take the sign of the result. In this sense, the weak classifiers vote on the classification: ∑ H(x) = sign( hi (x)) i How do we generate these weak classifiers? • We can create one by taking the data, training classifiers on it, and selecting with the smallest error rate (this will be classifier h1 .) • We can create another by taking the data and giving it some exaggeration of h1 ’s errors (e.g. pay more attention to the samples that h1 has trouble one). Training a new classifier on this gives us h2 . • We can create another by taking the data and giving it some exaggeration to the samples where the results of h1 ̸= h2 . Training a new classifier on this gives us h3 . This process can be recursive. That is, h1 could be made up of three individual classifiers as well, and so could h2 and h3 . 352 CHAPTER 13. SUPERVISED LEARNING 353 13.11. OVERFITTING For our classifiers we could use decision tree stumps, which is just a single test to divide the data into groups (i.e. just a part of a fuller decision tree). Note that boosting doesn’t have to use decision tree (stumps), it can be used with any classifier. We can assign a weight to each training example, wi , where to start, all weights are uniform. These weights can be adjusted to exaggerate certain examples. For convenience, we keep it so that all ∑ weights sum to 1, wi = 1, thus enforcing a distribution. We can compute the error ϵ of a given classifier as the sum of the weights of the examples it got wrong. For our aggregate classifier, we may want to weight the classifiers with the weights α1 , . . . , αn . ∑ H(x) = sign( αi hi (x)) i The general algorithm is: • We can set the starting weights wit for our training examples to be of examples and t = 1, representing the time (or the iteration). • Then we pick a classifier ht which minimizes the error rate. • Then we can pick αt . • And we can calculate w t+1 . • Then repeat. wt 1 N where N is the number Now suppose wit+1 = Zi e −α h (x)y (x) , where y (x) gives you the right classification (the right sign) for a given Training example. So if ht (x) correctly classifies a sample, then it and y (x) will be the same sign, so it will be a positive exponent. Otherwise, if ht (x) gives the incorrect sign, it will be a negative exponent. Z is some normalizing value so that we get a distribution. t t t We want to minimize the error bound for H(x) if αt = 12 ln 1−ϵ ϵt . 13.10.2 Stacking Stacking is similar to boosting, except that you also learn the weights for the weighted average by wrapping the ensemble of models into another model. 13.11 Overfitting Overfitting is a problem where your hypothesis describes the training data too well, to the point where it cannot generalize to new examples. It is a high variance problem. In contrast, underfitting is a high bias problem. To clarify, if your model has no bias, it means that it makes no errors on your training data (i.e. it does not underfit). If your model has no variance, it means your model generalizes well on your test data (i.e. it does not overfit). It is possible to have bias and variance problems simultaneously. CHAPTER 13. SUPERVISED LEARNING 353 13.12. REGULARIZATION 354 Another way to think of this is that: • variance = how much does the model vary if the training data changes? I.e. what space of possible models does this cover? High variance implies that the model is too sensitive to the particular training examples it looked at, and thus will not adapt well to other examples. • bias = is the average model close to the “true” solution/model? High bias means that the model is systematically incorrect. Bias and Variance There is a bias-variance trade-off, in which improvement of one is typically at the detriment of the other. You can think of generalization error as the sum of bias and variance. You want to keep both low, if possible. Overfitting can happen if your hypothesis is too complex, which can happen if you have too many features. So you will want to through a feature selection phase and pick out features which seem to provide the most value. Alternatively, you can use the technique of regularization, in which you keep all your features, but reduce the magnitudes/values of parameters θj . This is a good option all of your features are informative and you don’t want to get rid of any. 13.12 Regularization Regularization can be defined as any method which aims to reduce the generalization error of a model though it may not have the same effect on the training error. Since good generalization error is the main goal of machine learning, regularization is essential to success. Perhaps the most common form of regularization aims to favor smaller parameters. The intuition is that, if you have small values for your parameters θ0 , θ1 , . . . , θn , then you have a “simpler” hypothesis which is less prone to overfitting. In practice, there may be many combination of parameters which fit your data well. However, some may overfit/not generalize well. We want to introduce some means of valuing these simpler hypotheses over more complex ones (i.e. with larger parameters). We can do so with regularization. 354 CHAPTER 13. SUPERVISED LEARNING 355 13.12. REGULARIZATION So generally regularization is about shrinking your parameters to make them smaller; for this reason it is sometimes called weight decay. For linear regression, you accomplish this by modifying the cost ∑ function to include the term λ ni=1 θj2 at the end: J(θ) = m n ∑ 1 ∑ (hθ (x (i) ) − y (i) )2 + λ θj2 2m i=1 i=1 Note that we are not shrinking θ0 . In practice, it does not make much of a difference if you include it or not; standard practice is to leave it out. λ here is called the regularization parameter. It tunes the balance between fitting the data and keeping the parameters small (i.e. each half of the cost function). If you make λ too large for your problem, you may make your parameters too close to 0 for them to be meaningful - large values of lambda can lead to underfitting problems (since the parameters get close to 0). ∑ The additional λ ni=1 θj2 term is called the regularization loss, and the rest of the loss function is called the data loss. 13.12.1 Ridge regression A regularization method used in linear regression; the L2 norm of the parameters is constrained so that it less than or equal to some specified value (that is, this is L2 regularization): N ∑ p ∑ i=1 j=1 (yi − β0 − β̂ = argmin( β 2 xij βj ) + λ p ∑ βj2 ) j=1 Where: • λ ≥ 0 is a hyperparameter controlling the amount of shrinkage. • N is the number of data points • p is the number of dimensions 13.12.2 LASSO LASSO (Least Absolute Shrinkage and Selection Operator) is a regularization method which constrains the L1 norm of the parameters such that it is less than or equal to some specified value: p p N ∑ ∑ 1∑ 2 β̂ = argmin( (yi − β0 − xij βj ) + λ |βj |) 2 i=1 β j=1 j=1 (These two regularization methods are sometimes called shrinkage methods) CHAPTER 13. SUPERVISED LEARNING 355 13.12. REGULARIZATION 13.12.3 356 Regularized Linear Regression We can update gradient descent to work with our regularization term: m 1 ∑ (hθ (x (i) ) − y (i) ) · x0(i) θ0 := θ0 − α m i=1 θj := θj − α m 1 ∑ λ (hθ (x (i) ) − y (i) ) · xj(i) + θj m i=1 m j = (1, 2, 3, . . . , n) The θj part can be re-written as: θj := θj (1 − α m λ 1 ∑ )−α (hθ (x (i) ) − y (i) ) · xj(i) m m i=1 If we are using the normal equation, we can update it to a regularized form as well: θ = (X T X + λM)−1 X T y Where M is an n + 1 × n + 1 matrix, where the diagonal is all ones, except for the element at (0, 0) which is 0, and every other element is also 0. 13.12.4 Regularized Logistic Regression We can also update the logistic regression cost function with the regularization term: m n λ ∑ 1 ∑ (i) (i) (i) (i) θj2 J(θ) = − [ y log(hθ (x )) + (1 − y )log(1 − hθ (x ))] + m i=1 2m i=1 Then we can update gradient descent with the new derivative of this cost function for the parameters θj where j ̸= 0 θ0 := θ0 − α θj := θj − α m 1 ∑ (hθ (x (i) ) − y (i) ) · x0(i) m i=1 m 1 ∑ λ (hθ (x (i) ) − y (i) ) · xj(i) + θj m i=1 m j = (1, 2, 3, . . . , n) It looks the same as the one for linear regression, but again, the actual hypothesis function hθ is different. 356 CHAPTER 13. SUPERVISED LEARNING 357 13.13 13.13. PROBABILISTIC MODELING Probabilistic modeling Fundamentally, machine learning is all about data: • Stochastic, chaotic, and/or complex generative processes • Noisily observed • Partially observed So there is a lot of uncertainty - we can use probability theory to express this uncertainty in the form of probabilistic models. Generally, learning probabilistic models is known as probablistic machine learning; here we are primarily concerned with non-Bayesian machine learning. We have some data x1 , x2 , . . . , xn and some latent variables y1 , y2 , . . . , yn we want to uncover, which correspond to each of our data points. We have a parameter θ. A probabilistic model is just a parameterized joint distribution over all the variables: P (x1 , . . . , xn , y1 , . . . , yn |θ) We usually interpret such models as a generative model - how was our observed data generated by the world? So the problem of inference is about learning about our latent variables given the observed data, which we can get via the posterior distribution: P (y1 , . . . , yn |x1 , . . . , xn , θ) = P (x1 , . . . , xn , y1 , . . . , yn |θ) P (x1 , . . . , xn |θ) Learning is typically posed as a maximum likelihood problem; that is, we try to find θ which maximizes the probability of our observed data: θML = argmax P (x1 , . . . , xn |θ) θ Then, to make a prediction we want to compute the conditional distribution of some future data: P (xn+1 , yn+1 |x1 , . . . , xn , θ) Or, for classification, if we some classes, each parameterizing a joint distribution, we want to pick the class which maximizes the probability of the observed data: argmax P (xn+1 |θc ) c CHAPTER 13. SUPERVISED LEARNING 357 13.13. PROBABILISTIC MODELING 13.13.1 358 Discriminative vs Generative learning algorithms Discriminative learning algorithms include algorithms like logistic regression, decision trees, kNN, and SVM. Discriminative approaches try to find some way of separating data (discriminating them), such as in logistic regression which tries to find a dividing line and then sees where new data lies in relation to that line. They are unconcerned with how the data was generated. Say our input features are x and y is the class. Discriminative learning algorithms learn P (y |x) directly, that is it tries to learn the probability of y directly as a function of x. To put it another way, what is the probability this new data is of class y given the features x? Generative learning algorithms instead tries to develop a model for each class and sees which model new data conforms to. Generative learning algorithms learn P (x|y ) and P (y ) instead (that is, it models the joint distribution P (x, y )). So instead they ask, if this were class y , what is the probability of seeing these new features x? You’re basically trying to figure out what class is most likely to have generated the given features x. P (y ) is the class prior/the prior probability of seeing the class y , that is the probability of class y if you don’t have any other information. It is easier to estimate the conditional distribution P (y |x) than it is the joint distribution P (x, y ), though generative models can be much stronger. With P (x, y ), it is easy to calculate the same ) conditional (P (y |x) = PP(x,y (x) ). For both discriminative and generative approaches, you will have parameters and latent variables θ which govern these distributions. We treat θ as a random variable. 13.13.2 Maximum Likelihood Estimation (MLE) Say we have some observed values x1 , x2 , . . . , xn , generated by some latent model parameterized by θ, i.e. f (x1 , x2 , . . . , xn ; θ), where θ represents a single unknown parameter or a vector of unknown parameters. If we flip this we get the likelihood of θ, L(θ; x1 , x2 , . . . , xn ), which is the probability of θ, given the observed data. Likelihood is just the name for the probability of observed data as a function of the parameters. The maximum likelihood estimation is the θ (parameter) which maximizes this likelihood. That is, the value of θ which generates the observed values with the highest probability. MLE can be done by computing the derivative of the likelihood and solving for zero. It is a very common way of estimating parameters. (The Expectation-Maximization (EM) algorithm is a way of computing a maximum likelihood estimate for situations where some variables may be hidden.) If the random variables associated with the values, i.e. X1 , X2 , . . . , Xn , are iid, then the likelihood is just: 358 CHAPTER 13. SUPERVISED LEARNING 359 13.13. PROBABILISTIC MODELING L(θ; x1 , x2 , . . . , xn ) = n ∏ f (xi |θ) i=1 Sometimes this is just notated L(θ). So we are looking to estimate the θ which maximizes this likelihood (this estimate is often notated θ̂, the hat typically indicates an estimator): θ̂ = argmax L(θ; x1 , x2 , . . . , xn ) θ Logarithms are used, however, for convenience (i.e. dealing with sums rather than products), so instead we are often maximizing the log likelihood (which has its maximum at the same value (i.e. the same argmax) as the regular likelihood, though the actual maximum value may be different): ℓ(θ) = n ∑ log(f (xi |θ)) i=1 Another way of explaining MLE: We have some data X = {x (1) , x (2) , . . . , x (m) } and a parametric probability distribution p(x; θ). The maximum likelihood estimate for θ is: θML = argmax p(X; θ) θ = argmax θ m ∏ p(x (i) ; θ) i=1 (Notation note: p(X; θ) is read “the probability density of X as parameterized by θ”) Though the logarithm version mentioned above is typically preferred to avoid underflow. Typically, we are more interested in the conditional probability P (y |x; θ) (i.e. the probability of y given x, parameterized by θ), in which case, given all our inputs X and all our targets Y , we have the conditional maximum likelihood estimator: θML = argmax P (Y |X; θ) θ Assuming the examples are iid, this is: θML = argmax θ CHAPTER 13. SUPERVISED LEARNING m ∑ log P (y (i) |x (i) ; θ) i=1 359 13.13. PROBABILISTIC MODELING 360 Example Say we have a coin which may be unfair. We flip it ten times and get HHHHTTTTTT (we’ll call this observed data X). We are interested in the probability of heads, π, for this coin, so we can determine if it’s unfair or not. Here we just have a binomial distribution so the parameters here are n and p (or π as we are referring to it here). We know n as it is the sample size, so that parameter is easy to “estimate” (i.e. we already know it). All that’s left is the parameter p to estimate. So we can just use MLE to make this estimation; for binomial distributions it is rather trivial Because p is the probability of a successful trial, and it’s intuitive that the most likely p just reflects the number of observed successes over the total number of observed trials: π̃MLE = argmax P (X|π) π P (y |X) ≈ P (y |π̃MLE ) Where y is the outcome of the next coin flip. For this case our MLE would be π̃MLE = 0.4 because that is most likely to have generated our 4 observed data (we saw 10 heads). Also formulated as: θ̂ = argmax p(x (1) , . . . , x (n) ) θ For a Gaussian distribution, the sample mean is the MLE. 13.13.3 Expectation Maximization The expectation maximization (EM) algorithm is a two-staged iterative algorithm. Say you have a dataset which is missing some values. How can you complete your data? The EM algorithm allows you to do so. The two stages work as such: 1. Begin with initial parameters θ̂(t) , t = 0. 2. The “E-step” 1. Using the current parameters θ̂(t) , compute probabilities for each possible completion of the missing data. 2. Use these probabilities to create a weighted training set of these possible completions. 3. The “M-step” 1. Use a modified version of MLE (one which can deal with weighted samples) to derive new parameter estimates, θ̂(t+1) . 360 CHAPTER 13. SUPERVISED LEARNING 361 13.13. PROBABILISTIC MODELING 4. Repeat the E and M steps until convergence. Intuitively, what EM does is tries to find the parameters θ̂ which maximizes the log probability log P (x|θ) of the observed data x, much like MLE, except does so under the conditions of incomplete data. EM will converge on a local optimum (maximum) for this log probability. Example Say we have two coins A, B, which may not be fair coins. We conduct 5 experiments in which we randomly choose one of the coins (with equal probability) and flip it 10 times. We have the following results: 1. HTTTHHTHTH 2. HHHHTHHHHH 3. HTHHHHHTHH 4. HTHTTTHHTT 5. THHHTHHHTH We are still interested in learning a parameter for each coin, θ̂A , θ̂B , describing the probability of heads for each. If we knew which coin we flipped during each experiment, this would be a simple MLE problem. Say we did know which coin was picked for each experiment: 1. B: HTTTHHTHTH 2. A: HHHHTHHHHH 3. A: HTHHHHHTHH 4. B: HTHTTTHHTT 5. A: THHHTHHHTH Then we just use MLE and get: 24 = 0.8 24 + 6 9 θ̂B = = 0.45 9 + 11 θ̂A = That is, for each coin we just compute num heads total trials . But, alas, we are missing the data of which coin we picked for each experiment. We can instead apply the EM algorithm. Say we initially guess that θ̂A(0) = 0.60, θ̂B(0) = 0.50. For each experiment, we’ll compute the probability that coin A produced those results and the same probability for coin B. Here we’ll just show the computation for the first experiment. CHAPTER 13. SUPERVISED LEARNING 361 13.13. PROBABILISTIC MODELING 362 We’re dealing with a binomial distribution here, so we are using: ( ) P (x) = n p(1 − p)n−x , p = θ̂ x ( ) The binomial coefficient is the same for both coins (the xn term) and cancels out in normalization, so we only care about the remaining factors. So we will instead just use: P (x) = p(1 − p)n−x , p = θ̂ For the first experiment we have 5 heads (and n = 10). Using our current estimates for θ̂A , θ̂B , we compute: θA5 (1 − θA )10−5 ≈ 0.0008 θB5 (1 − θB )10−5 ≈ 0.0010 0.0008 ≈ 0.44 0.0008 + 0.0010 0.0010 ≈ 0.56 0.0008 + 0.0010 So for this first iteration and for the first experiment, we estimate that the chance of the picked coin being coin A is about 0.44, and about 0.56 for coin B. Then we generate the weighted set of these possible completions by computing how much each of these coins, as weighted by the probabilities we just computed, contributed to the results for this experiment ((5H, 5T )): 0.44(5H, 5T ) = (2.2H, 2.2T ), (coin A) 0.56(5H, 5T ) = (2.8H, 2.8T ), (coin B) Then we repeat this for the rest of the experiments, getting the following weighted values for each coin for each experiment: coin A 2.2H, 2.2T 7.2H, 0.8T 5.9H, 1.5T 1.4H, 2.1T 4.5H, 1.9T coin B 2.8H, 2.8T 1.8H, 0.2T 2.1H, 0.5T 2.6H, 3.9T 2.5H, 1.1T and sum up the weighted values for each coin: coin A coin B 21.3H, 8.6T 11.7H, 8.4T 362 CHAPTER 13. SUPERVISED LEARNING 363 13.14. REFERENCES Then we use these weighted values and MLE to update θ̂A , θ̂B , i.e.: 21.3 ≈ 0.71 21.3 + 8.6 11.7 ≈ ≈ 0.58 117. + 8.4 θ̂A(1) ≈ θ̂B(1) And repeat until convergence. Expectation Maximization as a Generalization of K-Means In K-Means we make hard assignments of datapoints to clusters (that is, they belong to only one cluster at a time, and that assignment is binary). EM is similar to K-Means, but we use soft assignments instead - datapoints can belong to multiple clusters in varying strengths (the “strengths” are probabilities of assignment to each cluster). When the centroids are updated, they are updated against all points, weighted by assignment strength (whereas in K-Means, centroids are updated only against their members). EM converges to approximately the same clusters as K-Means, except datapoints still have some membership to other clusters (though they may be very weak memberships). In EM, we consider that each datapoint is generated from a mixture of classes. For each K classes, we have the prior probability of that class P (C = i ) and the probability of the datapoint given that class P (x|C = i ). P (x) = K ∑ P (C = i )P (x|C = i ) i=1 These terms may be notated: π = P (C = i ) µi ∑ = P (x|C = i ) i What this is modeling here is that each centroid is the center of a Gaussian distribution, and we try to fit these centroids and their distributions to the data. 13.14 References • Review of fundamentals, IFT725. Hugo Larochelle. 2012. • Exploratory Data Analysis Course Notes. Xing Su. • Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman. CHAPTER 13. SUPERVISED LEARNING 363 13.14. REFERENCES 364 • Machine Learning. 2014. Andrew Ng. Stanford University/Coursera. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Evaluating Machine Learning Models. Alice Zheng. 2015. • Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. • Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy. • POLS 509: Hierarchical Linear Models. Justin Esarey. • Bayesian Inference with Tears. Kevin Knight, September 2009. • Learning to learn, or the advent of augmented data scientists. Simon Benhamou. • Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle, Ryan P. Adams. • What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou. • Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010. • Maximum Likelihood Estimation. Penn State Eberly College of Science. • Data Science Specialization. Johns Hopkins (Coursera). 2015. • Practical Machine Learning. Johns Hopkins (Coursera). 2015. • Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman. • CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej Karpathy. • How does expectation maximization work?. joriki. • How does expectation maximization work in coin flipping problem. joriki. • Deep Learning Tutorial - PCA and Whitening. Chris McCormick. 364 CHAPTER 13. SUPERVISED LEARNING 365 14 Neural Nets When it comes down to it, a neural net is just a very sophisticated way of fitting a curve. Neural networks with at least one hidden layer are universal approximators, which means that they can approximate any (continuous) function. This approximation can be improved by increasing the number of hidden neurons in the network (but increases the risk of overfitting). A key advantage to neural networks is that they are capable of learning features independently without much human involvement. Neural networks can also learn distributed representations. Say we have animals of some type and color. We could learn representations for each of them, e.g. with a neuron for a red cat, one for a blue dog, one for a blue cat, etc. But this would mean learning many, many representations (for instance, with three types and three colors, we learn nine representations). With distributed representations, we instead have neurons that learn the different colors and other neurons which learn the different types (for instance, with three types and three colors, we have six neurons). Thus the representation of a given case, such as a blue dog, is distributed across the neurons, and ultimately much more compact. One challenge with neural networks is that it is often hard to understand what a neural network has learned - they may be black boxes which do what we want, but we can’t peer in and clearly see how it’s doing it. 14.1 Biological basis Artificial neural networks (ANNs) are (conceptually) based off of biological neural networks such as the human brain. Neural networks are composed of neurons which send signals to each other in response to certain inputs. A single neuron takes in one or more inputs (via dendrites), processes it, and fires one output (via its axon). CHAPTER 14. NEURAL NETS 365 14.2. PERCEPTRONS 366 Source: http://ulcar.uml.edu/~iag/CS/Intro-to-ANN.html Note that the term “unit” is often used instead of “neuron” when discussing artificial neural networks to dissociate these from the biological version - while there is some basis in biological neural networks, there are vast differences, so it is a deceit to present them as analogous. 14.2 Perceptrons A perceptron, first described by Frank Rosenblatt in 1957, is an artificial neuron (a computational model of a biological neuron, first introduced in 1943 by Warren McCulloch and Walter Pitts). Like a biological neuron, it has multiple inputs, processes them, and returns one output. Each input has a weight associated with it. In the simplest artificial neuron, a “binary” or “classic spin” neuron, the neuron “fires” an output of “1” if the weighted sum of these inputs is above some threshold, or “-1” if otherwise. A single-layer perceptron can’t learn XOR: A line can’t be drawn to separate the As from the Bs; that is, this is not a linearly separable problem. Single-layer perceptrons cannot represent linearly inseparable functions. 14.3 Sigmoid (logistic) neurons A sigmoid neuron is another artificial neuron, similar to a perceptron. However, while the perceptron has a binary output, the sigmoid neuron has a continuous output, σ(w · x + b), defined by a special activation function known as the sigmoid function σ (also known as the logistic function): σ(z) ≡ 366 1 . 1 + e −z CHAPTER 14. NEURAL NETS 367 14.4. ACTIVATION FUNCTIONS XOR which can also be written: 1 1 + exp(− ∑ j wj xj − b) . Note that if z = w · x + b is a large positive number, then e −z ≈ 0 and thus σ(z) ≈ 1. If z is a large negative number, then e −z → ∞ and thus σ(z) ≈ 0. So at these extremes, the sigmoid neuron behaves like a perceptron. Here is the sigmoid function visualized: Which is a smoothed out step function (which is how a perceptron operates): Sigmoid neurons are useful because small changes in weights and biases will only produce small changes in output from a given neuron (rather than switching between binary output values as is the case with the step function, which is typically too drastic). 14.4 Activation functions The function that determines the output of a neuron is known as the activation function. In the binary/classic spin case, it might look like: CHAPTER 14. NEURAL NETS 367 14.4. ACTIVATION FUNCTIONS 368 The sigmoid function The step function 368 CHAPTER 14. NEURAL NETS 369 14.4. ACTIVATION FUNCTIONS weights = [...] inputs = [...] sum_w = sum([weights[i] * inputs[i] for i in range(len(inputs))]) def activate(sum_w, threshold): return 1 if sum_w > threshold else -1 Or: { output = Note that w · x = vectors. ∑ j ∑ −1 if j wj xj ≤ threshold ∑ 1 if j wj xj > threshold wj xj , so it can be notated as a dot product where the weights and inputs are In some interpretations, the “binary” neuron returns “0” or “1” instead of “-1” or “1”. An activation function can generally be described as some function: output = f (w · x + b) where b is the bias (see below). 14.4.1 Common activation functions A common activation function is the sigmoid function, which takes input and squashes it to be in [0, 1], it has the form: σ(x) = 1 1 + e −x However, the sigmoid activation function has some problems. If the activation yields values at the tails of 0 or 1, the gradient ends up being almost 0. In backpropagation, this local gradient is multiplied with the gradient of the node’s output against the total error - if this local gradient is near 0, it “kills” the gradient preventing any signal from going further backwards in the network. For this reason, when using the sigmoid activation function you must be careful of how you initialize the weights - if they are too large, you will “saturate” the network and kill the gradient in this way. Furthermore, sigmoid outputs are not zero-centered: This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. x>0 elementwise in f=wTx+b)), then the gradient on the weights CHAPTER 14. NEURAL NETS 369 14.4. ACTIVATION FUNCTIONS 370 Sigmoid activation function w will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression f). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above. (CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Modeling one neuron, Andrej Karpathy) The tanh activation function is another option; it squishes values to be in [−1, 1]. However, while its output is zero-centered, it suffers from the same activation saturation issue that the sigmoid does. tanh(x) e z − e −z e z + e −z Note that tanh is really just a rescaled sigmoid function: σ(x) = 1 + tanh( x2 ) 2 The Rectified Linear Unit (ReLU) is f (x) = max(0, x), that is, it just thresholds at 0. Compared to the sigmoid/tanh functions, it converges with stochastic gradient descent quickly. Though there is 370 CHAPTER 14. NEURAL NETS 371 14.4. ACTIVATION FUNCTIONS tanh activation function ReLU activation function CHAPTER 14. NEURAL NETS 371 14.4. ACTIVATION FUNCTIONS 372 not the same saturation issue as with the sigmoid/tanh functions, ReLUs can still “die” in a different sense - their weights can be updated such that the neuron never activates again, which causes the gradient through that neuron to be zero from then on, thus resulting in the same “killing” of the gradient as with sigmoid/tanh. In practice, lowering the learning rate can avoid this. Leaky ReLUs are an attempt to fix this problem. Rather than outputting 0 when x < 0, there will instead be a small negative slope (∼ 0.01) when x < 0. That is, yi = αi x when xi < 0, and αi is some fixed value. However, it does not always work well. Note that αi in this case can also be a parameter to learn instead of a fixed value. These are called parametric ReLUs. Another alternative is a randomized leaky ReLU, where αi is a random variable during training and fixed afterwards. There are also some units which defy the conventional activation form of output = f (w · x + b). One is the Maxout neuron. It’s function is max(w1T x + b1 , w2T x + b2 ), which is a generalization of the ReLU and the leaky ReLU (both are special forms of Maxout). It has the benefits of ReLU but does not suffer the dying ReLU problem, but it’s main drawback is that it doubles the number of parameters for each neuron (since there are two weight vectors and two bias units). Karpathy suggests: Use the ReLU non-linearity, be careful with your learning rates and possibly monitor the fraction of “dead” units in a network. If this concerns you, give Leaky ReLU or Maxout a try. Never use sigmoid. Try tanh, but expect it to work worse than ReLU/Maxout. source Activation Function Propagation Backpropagation 1 1+e −x2 ∂E 1 [ ∂E ∂x ]s = [ ∂y ]s (1+e x2 )(1+e −x2 ) Sigmoid ys = Tanh ys = tanh(xs ) ∂E 1 [ ∂E ∂x ]s = [ ∂y ]s cosh2 xs ReLu ys = max(0, xs ) ∂E [ ∂E ∂x ]s = [ ∂y ]s I{xs > 0} Ramp ys = min(−1, max(1, xs )) ∂E [ ∂E ∂x ]s = [ ∂y ]s I{1− < xs < 1} There is also the hard-tanh activation function, which is an approximation of tanh that is faster to compute and take derivatives of: −1 hardtanh(x) = 1 x 372 x < −1 x >1 otherwise CHAPTER 14. NEURAL NETS 373 14.4. ACTIVATION FUNCTIONS 14.4.2 Softmax function The softmax function (called such because it is like a “softened” maximum function) may be used as the output layer’s activation function. It takes the form: N fi (NETN i ) N e NETi = ∑k j N e NETj To clarify, we are summing over all the output neurons in the denominator. This function has the properties that it sums to 1 and that all of its outputs are positive, which are useful for modeling probability distributions. The cost function to use with softmax is the (categorical) cross-entropy loss function. It has the nice property of having a very big gradient when the target value is 1 and the output is almost 0. The categorical cross-entropy loss: Li = − ∑ ti,j log(pi,j ) j 14.4.3 Radial basis functions You can base your activation function off of radial basis functions (RBFs): f (X) = N ∑ ai p(||bi X − ci ||) i=1 where • • • • • X = input vector of attributes p = the RBF c = vector center (peak) of the RBF a = the vector coefficient/weight for each RBF b = the vector coefficient/weight for each input attribute A radial basis function (RBF) is a function which is: • symmetric about its center, which is its peak (with a value of 1) • can be in n dimensions, but always returns a single scalar value r , the distance (usually Euclidean) b/w the input vector and the RBF’s peak: r = ||x − xi || CHAPTER 14. NEURAL NETS 373 14.4. ACTIVATION FUNCTIONS 374 ϕ is used to denote a RBF. A neural network that uses RBFs as its activation functions is known as radial basis function neural network. The 1D Gaussian RBF is an example: 1D Gaussian RBF Defined as: ϕ(r ) = e −r 2 The Ricker Wavelet is another example: The Ricker Wavelet Defined as: r2 ϕ(r ) = (1 − r 2 ) · e − 2 374 CHAPTER 14. NEURAL NETS 375 14.5. FEED-FORWARD NEURAL NETWORKS 14.5 Feed-forward neural networks A feed-forward neural network is a simple neural network with an input layer, and output layer, and one or more intermediate layers of neurons. These layers are fully-connected in that every neuron of the previous layer is connected to every neuron in the following layer. Such layers are also called affine or dense layers. When we describe the network in terms of layers as a “N-layer” neural network, we leave out the input layer (i.e. a 1-layer neural network has an input and an output layer, a 2-layer one has an input, a hidden, and an output layer.). ANNs may also be described by its number of nodes (units), or, more commonly, by the number of parameters in the entire network. (CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 1: Setting up the Architecture. Andrej Karpathy. https://cs231n.github.io/neural-networks-1/) This model is often called “feed-forward” because values go into the input layer and are fed into subsequent layers. Different learning algorithms can train such a network so that its weights are adjusted appropriately for a given task. It’s worth emphasizing that the structure of the network is distinct from the learning algorithm which tunes its weights and biases. The most popular learning algorithm for neural networks is backpropagation. 14.6 Training neural networks 14.6.1 Backpropagation The most common algorithm for adjusting a neural network’s weights and biases is backpropagation. Backpropagation is just the calculation of partial derivatives (the gradient) by moving backwards through the network (from output to input), accumulating them by applying the chain rule. In particular, it computes the gradient of the loss function with respect to the weights in the network (i.e. the derivatives of the loss function with respect to each weight in the network) in order to update the weights. We compute the total error for the network on the training data and then want to know how a change in an individual weight within the network affects this total error (i.e. the result of our cost function), e.g. ∂E∂wtotal . i “Backpropagation” is almost just a special term for the chain rule in the context of training neural networks. This is because a neural network can be thought of as a composition of functions, in which case to compute the derivative of the overall function, you just apply the chain rule for computing derivatives. To elaborate on thinking of neural network as a “composition of functions”: each layer represents a function taking in the inputs of the previous layer’s output, e.g. if the previous layer is a function that outputs a vector, g(x), then the next layer, if we call it a function f , is f (g(x)). CHAPTER 14. NEURAL NETS 375 14.6. TRAINING NEURAL NETWORKS 376 The general procedure for training a neural network with backpropagation is: • Initialize the neural network’s weights and biases. • Training data is input into the NN to the output neurons, in feed-forward style. • The error of the output is then propagated backwards (from the output layer back to the input layer). • As the error is propagated, weights and biases are adjusted (according to a learning rate, detailed below) to minimize the remaining error between the actual and desired outputs. Consider the following simple neural net: Simple neural network Here’s a single neuron expanded: Single sigmoid neuron Remember that a neuron processes its inputs by computing the dot product of its weights and inputs (i.e. the sum of its weight-input products) and then passes this resulting net input into its activation function (in this case, it is the sigmoid function). 376 CHAPTER 14. NEURAL NETS 377 14.6. TRAINING NEURAL NETWORKS Say we have passed some training data through the network and computed the total error as Etotal . total To update the weight w2,1 , for example, we are looking for the partial derivative ∂E ∂w2,1 , which by the chain rule is equal to: ∂Etotal ∂Etotal ∂o2,1 ∂i2,1 = × × ∂w2,1 ∂o2,1 ∂i2,1 ∂w2,1 Then we take this value and subtract it, multiplied by a learning rate η (sometimes notated α), from the current weight w2,1 to get w2,1 ’s updated weight, though updates are only actually applied after these update values have been computed for all of the network’s weights. If we wanted to calculate the update value for w1,1 , we do something similar: ∂Etotal ∂Etotal ∂o1,1 ∂i1,1 = × × ∂w1,1 ∂o1,1 ∂i1,1 ∂w1,1 Any activation function can be used with backprop, it just must be differentiable anywhere. The chain rule of derivatives (refresher) (adapted from the CS231n notes cited below) Refresher on derivatives: say you have a function f (x, y , z ). The derivative of f with respect to x ∂f is called a partial derivative, since it is only with respect to one of the variables, is notated ∂x and is just a function that tells you how much f changes due to x at any point. The gradient is just a vector of these partial derivatives, so that there is a partial derivative for each variable (i.e. here it would be a vector of the partial derivative of f wrt x, and then wrt y , and then wrt z). ∂f ∂f As a simple example, consider the function f (x, y ) = xy . The derivatives here are just ∂x = y , ∂y =x ∂f What does this mean? Well, take ∂x = y . This means that, at any given point, increasing x by a infinitesimal amount will change the output of the function by y times the amount that x changed. So if y = −3, then any small change in x will decrease f by that amount times −3. Now consider the function f (x, y , z ) = (x + y )z. We can derive this by declaring q = x + y and then re-writing f to be f = qz . We can compute the gradient of f in this form (note that it is the same ∂q ∂f ∂f as f (x, y ) = xy from before): ∂q = z, ∂z = q. The gradient of q is also simple: ∂q ∂x = 1, ∂y = 1. We can combine these gradients to get the gradient of f wrt to x, y , z instead of wrt to q, z as we have now. We can get the missing partial derivatives wrt to x and y by using the chain rule, which just requires that we multiply the appropriate partials: ∂f ∂q ∂f ∂f ∂q ∂f = , = ∂x ∂q ∂x ∂y ∂q ∂y In code (adapted from the CS231 notes cited below) # set some inputs CHAPTER 14. NEURAL NETS 377 14.6. TRAINING NEURAL NETWORKS 378 x = -2; y = 5; z = -4 # perform the forward pass q = x + y # q becomes 3 f = q * z # f becomes -12 # perform the backward pass (backpropagation) in reverse order: # first backprop through f = q * z dfdz = q # df/dz = q, so gradient on z becomes 3 dfdq = z # df/dq = z, so gradient on q becomes -4 dqdx = 1. dqdy = 1. # now backprop through q = x + y dfdx = dqdx * dfdq # dq/dx = 1. And the multiplication here is the chain rule! dfdy = dqdy * dfdq # dq/dy = 1 So essentially you can decompose any function into smaller, simpler functions, compute the gradients for those, then use the chain rule to aggregate them into the original function’s gradient. The details Consider the neural network: A neural network Where: • Li is layer i 378 CHAPTER 14. NEURAL NETS 379 • • • • 14.6. TRAINING NEURAL NETWORKS Nji is the jth node in layer i N is the number of layers k is the number of outputs, e.g. classes, or k = 1 for regression ni is the number of nodes in layer i Given training data: (X (1) , Y (1) ) (X (2) , Y (2) ) .. . (X (m) , Y (m) ) Where X (i) ∈ R1×d and • d is the dimensionality of the input • m is the number of training examples Thus X is the input matrix, X ∈ Rm×d . For a node Nji : • • • • • bji is the bias for the node (scalar), i.e. bji ∈ R wji is the weights for the node, wji ∈ R1×ni−1 fj i is the activation function for the node NETij is the net input for the node, NETij = Wji · OUTi−1 + bji , NETij ∈ R OUTij is the output for the node, OUTij = fj i (NETij ), OUTij ∈ R For a layer Li : • bi is the bias vector for the layer, bi ∈ Rni ×1 , i.e. b1i i b 2 bi = ... bni i • W i is the weight matrix for the layer, W i ∈ Rni ×ni−1 , i.e. W1i i W 2 Wi = ... Wni i CHAPTER 14. NEURAL NETS 379 14.6. TRAINING NEURAL NETWORKS 380 • f i is the activation function for the layer, since generally f i = f1i = f2i = · · · = fnii • NETi is the net input for the layer: NETi = W i · OUTi−1 + bi , NETi ∈ Rni ×1 • OUTi is the output for the layer: OUTi = f i (NETi ), OUTi ∈ Rni ×1 Note that OUTN = hθ (X). The feed-forward step is a straightforward algorithm: OUT^0 = X for i in 1...N OUT^i = f^i(W^i \cdot OUT^{i-1}+b^i) With backpropagation, we are interested in the gradient ∇J for each iteration. ∇J includes the ∂J ∂J components ∂W i and ∂b i (that is, how the cost function changes with respect to the weights and j j biases in the network). The main advantage of backpropagation is that it allows us to compute this gradient efficiently. There are other ways we could do it. For instance, we could manually calculate the partial derivatives of the cost function with respect to each individual weight, which, if we had w weights, would require computing the cost function w times, which requires w forward passes. Naturally, given a complex network with many, many weights, this becomes extremely costly to compute. The beauty of backpropagation is that we can compute these partial derivatives (that is, the gradient), with just a single forward pass. Some notation: • J is the cost function for the network (what it is depends on the use case) • δ i is the error for layer i • ⊙ is the elementwise product (“Hadamard” or “Schur” product) The error for a layer i is how much the cost function changes wrt to that layer’s net input, i.e. ∂J δ i = ∂NET i. For the output layer, this is straightforward (by applying the chain rule): ∂J ∂J ∂OUTN δ = = ∂NETN ∂OUTN ∂NETN N 380 CHAPTER 14. NEURAL NETS 381 14.6. TRAINING NEURAL NETWORKS Since OUTN = f N (NETN ), then ∂OUTN ∂NETN = (f N )′ (NETN ). Thus we have: δN = ∂J (f N )′ (NETN ) ∂OUTN ∂J Note that for ∂OUT N , we are computing the derivative of the cost function J with respect to each training example’s corresponding output, and then we average them (in some situations, such as when the total number of training examples is not fixed, their sum is used). It is costly to do this across all training examples if you have a large training set, in which case, the minibatch stochastic variant of gradient descent may be more appropriate. (TODO this may need clarification/revision) For the hidden layer prior to the output, LN−1 , we would need to connect that layer’s net input, NETN−1 , to the cost function J: δ N−1 = ∂J ∂J ∂OUTN ∂NETN ∂OUTN−1 = ∂NETN−1 ∂OUTN ∂NETN ∂OUTN−1 ∂NETN−1 We have already calculated the term ∂J ∂OUTN N ∂OUT ∂NETN δ N−1 = δ N Since OUTN−1 = f N−1 (NETN−1 ), then as δ N , so this can be restated: ∂NETN ∂OUTN−1 ∂OUTN−1 ∂NETN−1 ∂OUTN−1 ∂NETN−1 = (f N−1 )′ (NETN−1 )′ . Similarly, since NETN = W N · OUT N−1 + bN , then ∂NETN ∂OUTN−1 = W N. Thus: δ N−1 = W N δ N ⊙ (f N−1 )′ (NETN−1 ) This is how we compute δ i for all i ̸= N, i.e. we push back (backpropagate) the next (“next” in the forward sense) layer’s error, δ i+1 , to Li to get δ i . So we can generalize the previous equation: δ i = W i+1 δ i+1 ⊙ (f i )′ (NETi ), i ̸= N We are most interested in updating weights and biases, rather than knowing the errors themselves. That is, we are most interested in the quantities: ∂J ∂J , ∂W i ∂bi CHAPTER 14. NEURAL NETS 381 14.6. TRAINING NEURAL NETWORKS 382 For any layer Li . These are relatively easy to derive. We want to update the weights such that the error is lowered with the new weights. Thus we compute the gradient of the error with respect to the weights and biases to learn in which way the error is increasing and by how much. Then we move in the opposite direction by that amount (typically weighted by a learning rate). ∂J ∂J ∂NETi = ∂bi ∂NETi ∂bi ∂J ∂J ∂NETi = ∂W i ∂NETi ∂W i We previously showed that δ i = ∂J , ∂NETi so here we have: i ∂J i ∂NET = δ ∂bi ∂bi i ∂J i ∂NET =δ ∂W i ∂W i Then, knowing that NETi = W i · OUTi−1 + bi , we get: ∂NETi =1 ∂bi ∂NETi = OUTi−1 i ∂W Thus: ∂J = δi ∂bi ∂J = δ i OUTi−1 ∂W i Then we can use these for gradient descent. A quick bit of notation: δ j,i refers to layer i ’s error for the jth training example; similarly, OUTj,i refers to layer i ’s output for the jth training example. η ∑ j,i δ (OUTj,i−1 )T m j η ∑ j,i bi → bi − δ m j Wi → Wi − i−1 ∂J i ∂J i So, to clarify, we are computing ∂b for each training example and computing i = δ , ∂W i = δ OUT their average. As mentioned before, in some cases you may only take their sum, which just involves the removal of the m1 term, so you are effectively just scaling the change. 382 CHAPTER 14. NEURAL NETS 383 14.6. TRAINING NEURAL NETWORKS Considerations Backpropagation is not a guarantee of training at all nor of quick training. Possible issues include: • network paralysis: if the weights become very large, the neurons’ OUTs may become very large, where the derivative of the activation function is very small, so weights are not really updated and get “stuck” at large values. • local minima: statistical training methods can be used (such as simulated annealing), to avoid local minima but increase training time • step size: if it is too small, training is too slow, if it is too large, paralysis or instability (no convergence) are possible • stability: that the network does not mess up its learning of something else to learn another thing. For instance, say it learns good weights for one input, but to learn good weights for another input, it “overwrites” or “forgets” what it learned about the prior input. Note that in high dimensions, local minima are usually not a problem. More typically, there are saddle points, which slow down training but can be escaped in time. This is because with many dimensions, it is unlikely that a point is a minimum is all dimensions (if we consider that a point is a minimum in one dimensions with probability p, then it has probability p n to be a minimum in all n dimensions); it is, however, likely that it is a local minimum in some of the dimensions. As training nears the global minimum, p increases, so if you do end up at a local minimum, it will likely be close enough to the global minimum. 14.6.2 Statistical (stochastic) training Statistical (or “stochastic”) training methods, contrasted with deterministic training methods (such as backpropagation as described abov), involve some randomness to avoid local minima. They generally work by randomly leaving local minima to possibly find the global minimum. The severity of this randomness decreases over time so that a solution is “settled” upon (this gradual “cooling” of the randomness is the key part of simulated annealing). Simulated annealing applied as a training method to a neural network is called Boltzmann training (neural networks trained in this way are called Boltzmann machines): 1. set T (the artificial temperature) to a large value 2. apply inputs, calculate outputs and objective function 3. make random weight changes, recalculate network output and change in objective function 4a. if objective function improves, keep weight changes 4b. if the objective function worsens, accept the change according to the probability drawn from Boltzmann distribution, P (c), select a random variable r from a uniform distribution in [0, 1]; if P (c) > r , keep the change, otherwise, don’t. CHAPTER 14. NEURAL NETS 383 14.6. TRAINING NEURAL NETWORKS 384 P (c) = exp( −c ) kT Where: • c the change in the objective function • k a constant analogous to the Boltzmann’s constant in simulated annealing, specific for the current problem • T the artificial temperature • P (c) the probability of the change c in the objective function Steps 3 and 4 are repeated for each of the weights in the network as T is gradually decreased. The random weight change can be selected in a few ways, but one is just choosing it from a Gaussian 2 distribution, P (w ) = exp( −w T 2 ), where P (w ) is the probability of a weight change of size w and T is the artificial temperature. Then you can use Monte Carlo simulation to generate the actual weight change, ∆w . Boltzmann training uses the following cooling rate, which is necessary for convergence to a global minimum: T (t) = T0 log(1 + t) Where T0 is the initial temperature, and t is the artificial time. The problem with Boltzmann training is that it can be very slow (the cooling rate as computed above is very low). This can be resolved by using the Cauchy distribution instead of the Boltzmann distribution; the former has fatter tails so has a higher probability of selecting large step sizes. Thus the cooling rate can be much quicker: T (t) = T0 1+t The Cauchy distribution is: P (x) = T (t) T (t)2 + x 2 where P (x) is the probability of a step of size x. This can be integrated, which makes selecting random weights much easier: xc = ρT (t) tan(P (x)) 384 CHAPTER 14. NEURAL NETS 385 14.6. TRAINING NEURAL NETWORKS Where ρ is the learning rate coefficient and xc is the weight change. Here we can just select a random number from a uniform distribution in (− π2 , π2 ), then substitute this for P (x) and solve for x in the above, using the current temperature. Cauchy training still may be slow so we can also use a method based on artificial specific heat (in annealing, there are discrete energy levels where phase changes occur, at which abrupt changes in the “specific heat” occur). In the context of artificial neural networks, we define the (pseudo)specific heat to be the average rate of change of temperature with the objective function. The idea is that there are parts where the objective function is sensitive to small changes in temperature, where the average value of the objective function makes an abrupt change, so the temperature must be changed slowly here so as not to get stuck in a local minima. Where the average value of the objective function changes little with temperature, large changes in temperature can be used to quicken things. Still, Cauchy training may be much slower than backprop, and can have issues of network paralysis (because it is possible to have very large random weight changes), esp. if a nonlinearity is used as the activation function (see the bit on network paralysis and the sigmoid function above). Cauchy training may be combined with backprop to get the best of both worlds - it simply involves computing both the backprop and Cauchy weight updates and applying their weighted sum as the update. Then, the objective function’s change is computed, and like with Cauchy training, if there is an improvement, the weight change is kept, otherwise, it is kept with a probability determined by the Boltzmann distribution. The weighted sum of the individual weight updates is controlled by a coefficient η, such that the sum is η[α∆Wt−1 + (1 − α)δOUT] + (1 − η)xc , so that if η = 0, the training is purely Cauchy, and if η = 1, it becomes purely backprop. There is still the issue of the possibility of retaining a massive weight change due to the Cauchy distribution’s infinite variance, which creates the possibility of network paralysis. The recommended approach here is to detect saturated neurons by looking at their OUT values - if it is approaching the saturation point (positive or negative), apply some squashing function to its weights (note that this squashing function is not restricted to the range [−1, 1] and in fact may work better with a larger range). This potently reduces large weights while only attenuating smaller ones, and maintains symmetry across weights. 14.6.3 Learning rates The amount weights and biases are adjusted is determined by a learning rate η (sometimes called a delta rule or delta function). This often involves some constant which manages the momentum of learning µ (see below for more on momentum). This learning constant can help “jiggle” the network out of local optima, but you want to take care that it isn’t set so high that the network will also jiggle out of the global optima. As a simple example: # LEARNING_CONSTANT is defined elsewhere def adjust_weight(weight, error, input): return weight + error * input * LEARNING_CONSTANT CHAPTER 14. NEURAL NETS 385 14.6. TRAINING NEURAL NETWORKS 386 In some cases, a simulated annealing approach is used, where the learning constant may be tempered (made smaller, less drastic) as the network evolves, to avoid jittering the network out of the global optima. Adaptive learning rates Over the course of training, it is often better to gradually decrease the learning rate as you approach an optima so that you don’t “overshoot” it. Separate adaptive learning rates The appropriate learning rate can vary across parameters, so it can help to have different adaptive learning rates for each parameter. For example, the magnitudes of gradients are often very different across layers (starting small early on, growing larger further on). The fan-in of a neuron (number of inputs) also has an effect, determining the size of “overshoot” effects (the more inputs there are, the more weights are changed simultaneously, all to adjust the same error, which is what can cause the overshooting). So what you can do is manually set a global learning rate, then for each weight multiply this global learning rate by a local gain, determined empirically per weight. One way to determine these learning rates is as follows: • start with a local gain gij = 1 for each weight wij • increase the local gain if the gradient for that weight does not change sign • use small additive increases and multiplicative decreases: gij (t − 1) + 0.05 ∂E ∂E if( ∂w (t) ∂w (t − 1)) > 0 ij ij 0.95gij (t − 1) otherwise gij (t) = This ensures that big gains decay rapidly when oscillations start. Another tip: limit the gains to line in some reasonable range, e.g. [0.1, 10] or [0.01, 100] Note that these adaptive learning rates are meant for full batch learning or for very big mini-batches. Otherwise, you may encounter gradient sign changes that are just due to sampling error of a minibatch. These adaptive learning rates can also be combined with momentum by using agreement in sign between the current gradient for a weight and the velocity for that weight. Note that adaptive learning rates deal only with axis-aligned effects. 14.6.4 Training algorithms There are a variety of gradient descent algorithms used for training neural networks. Some of the more popular ones include the following. 386 CHAPTER 14. NEURAL NETS 387 14.6. TRAINING NEURAL NETWORKS Momentum Momentum is a technique that can be combined with gradient descent to improve its performance. Conceptually, it applies the idea of velocity and friction to the error surface (imagine a ball rolling around the error surface to find a minimum). We incorporate a matrix of velocity values V , with the same shape as the matrix of weights and biases for the network (for simplicity, we will roll the weights and matrices together into a matrix W ). To do so, we break our gradient descent update rule (W → W ′ = W − η∇J) into two separate update rules; one for updating the velocity matrix, and another for updating the weights and biases: V → V ′ = µV − η∇J W → W′ = W + V ′ Another hyperparameter, µ ∈ [0, 1], is also introduced - this controls the “friction” of the system (µ = 1 is no friction). It is known as the momentum coefficient, and it is tuned to prevent “overshooting” the minimum. You can see that if µ is set to 0, we get the regular gradient descent update rule. Nesterov momentum A variation on regular momentum is Nesterov momentum: V → V ′ = µV − η∇J(θ + µV ) W → W′ = W + V ′ Here we add the velocity to the parameters before computing the gradient. Using gradient descent with this momentum is also called the Nesterov accelerated gradient. Adagrad Adagrad is an enhancement of Nesterov momentum that keeps track of squared gradients over time. This allows it to identify frequently updated parameters and infrequently updated parameters - as a result, learning rates can be adapted per parameter over time (e.g. higher learning rates are assigned to infrequently updated parameters). This makes it quite useful for sparse data, and also means that the learning rate does not need to be manually tuned. More formally, for each parameter θi we have gt,i = ∇J(θi ), so we re-write the update for a parameter θi at time t to be: θt+1,i = θt,i − η · gt,i CHAPTER 14. NEURAL NETS 387 14.6. TRAINING NEURAL NETWORKS 388 The learning rate η is then modified at each time step t by a diagonal matrix Gt ∈ Rd×d . Each diagonal element i , i in Gt is the sum of squared gradients of parameter θi up to time t. In particular, we divide η by the square root of this matrix (empirically, the square root improves performance): η θt+1,i = θt,i − √ · gt,i Gt,ii + ϵ An additional smoothing term ϵ is included to prevent division by zero (e.g. 1e-8). As vector operations, this is written: η θt+1 = θt − √ ⊙ gt Gt + ϵ Note that as training progresses, the denominator term Gt will grow very large (the sum of squared gradients accumulate), such that the learning rate eventually becomes very small and learning virtually ceases. Adadelta Adadelta is an improvement on Adagrad designed to deal with its halting learning rate. Whereas Adagrad takes the sum of all past squared gradients for a parameter, Adadelta takes only the past w past squared gradients. This is implemented as follows. We keep a running average E[g 2 ]t (γ is like a momentum term, usually around 0.9): E[g 2 ]t = γE[g 2 ]t−1 + (1 − γ)gt2 And then update the Adagrad equation to replace the matrix Gt with this running (exponentially decaying) average: η ∆θt = − √ gt E[g 2 ]t + ϵ Note that this running average is the same as the root mean squared (RMS) error criterion of the gradient, so this can be re-written: ∆θt = − η RMS[g]t There is one more enhancement that is part of Adadelta. The learning rate η‘s units (as in “units of measurement”, not as in “hidden units”) do not match the parameters’ units; this is true for all 388 CHAPTER 14. NEURAL NETS 389 14.6. TRAINING NEURAL NETWORKS training methods shown up until now. This is resolved by defining another exponentially decaying average, this one of squared parameter updates: E[∆θ2 ]t = γE[∆θ2 ]t−1 + (1 − γ)∆θt2 The RMS error of this term is also taken: √ RMS[∆θ]t = E[∆θ2 ]t + ϵ Thus the Adadelta update is: ∆θt = − RMS[∆θ]t gt RMS[g]t Note that Adadelta without the numerator RMS term, that is: ∆θt = − η RMS[g]t is known as RMSprop; for RMSprop typical values are γ = 0.9, η = 0.001. Adam Adam, short for “Adaptive Moment Estimation”, is, like Adagrad, Adadelta, and RMSprop, an adaptive learning rate algorithm. Like Adadelta and RMSprop, Adam keeps track of an exponentially decaying average of past squared gradients (here, it is vt ), but it also keeps track of an exponentially decaying average of past (non-squared) gradients mt , similar to momentum: mt = β1 mt−1 + (1 − β1 )gt vt = β2 vt−1 + (1 − β2 )gt2 mt and vt are estimates of the first moment (the mean) and second moment (the uncentered variance) of the gradients, respectively. Note that the β terms are decay rates (typically β1 = 0.9, β2 = 0.999) and that mt , vt are initialized to zero vectors. As such, mt , vt tend to be a biased towards zero, so bias-corrected versions of each are computed: mt 1 − β1t vt v̂t = 1 − β2t m̂t = CHAPTER 14. NEURAL NETS 389 14.6. TRAINING NEURAL NETWORKS 390 Then the Adam update rule is simply: η θt+1 = θt − √ m̂t v̂t + ϵ 14.6.5 Batch Normalization Batch normalization is a normalization method for mini-batch training, which can improve training time (it allows for higher learning rates), act as a regularizer, and reduce the importance of proper parameter initialization. It is applied to intermediate representations in the network. For a mini-batch x (of size m), the sample mean and variance for each feature k is computed: x̄k = m 1 ∑ xi,k m i=1 σk2 = m 1 ∑ (xi,k − x̄k )2 m i=1 Each feature k is then standardized as follows: xk − x̄k x̂k = √ σk2 + ϵ where ϵ is a small positive constant to improve numerical stability. Standardizing intermediate representations in this way can weaken the representational power of the layer, so two additional learnable parameters γ and β are introduced to scale and/or shift the data. Altogether, the batch normalization function is as follows: BNxk = γk x̂k + βk When γk = σk and βk = x̄k , we recover the original representation. Given a layer with some activation function ϕ, which would typically be defined as ϕ(W x + b), we can redefine it with batch normalization: ϕ(BN(W x)) The bias is dropped because its effect is cancelled by the standardization. During test time, we must use x̂k and σk2 as computed over the training data; the final values are usually achieved by keeping a running average of these statistics over the mini-batches during training. Refer to Batch Normalized Recurrent Neural Networks (César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio) for more details. 390 CHAPTER 14. NEURAL NETS 391 14.6.6 14.6. TRAINING NEURAL NETWORKS Cost (loss/objective/error) functions We have some cost function, which is a function of our parameters, typically notated J(θ). For regression, this is often the mean squared error (MSE), also known as the quadratic cost: J(θ) = m 1 ∑ (y (i) − hθ (X (i) ))2 m Note that hθ represents the output of the entire network. So for a single example, the cost function is: (y (i) − hθ (X (i) ))2 For deriving convenience, we’ll include a 12 term. Including this term just scales the cost function, which doesn’t impact the outcome, and for clarity, we’ll substitute f N (NETN ) for hθ (X (i) ), since they are equivalent. (y (i) − f N (NETN ))2 2 Deriving with respect to W N and bN gives us the following for individual examples: ∂J = (f N (NETN ) − y )(f N )′ (NETN )X (i) ∂W N ∂J = (f N (NETN ) − y )(f N )′ (NETN ) ∂bN Note that these are dependent on the derivative of the output layer’s activation function, (f N )′ (NET N ). This can cause training to become slow in the case of activation functions like the sigmoid function. This is because the derivative near the sigmoid’s tails (i.e. where it outputs values close to 0 or 1) is very low (the sigmoid flattens out at its tails). Thus, when the output layer’s function has this property, and outputs values near 0 and 1 (in the case of sigmoid), this reduces the entire partial derivative, leading to small updates, which has the effect of slow learning. When slow learning of this sort occurs (that is, the kind caused by the activation functions outputting at their minimum or maximum), it is called saturation, and it is a common problem with neural networks. For binary classification, a common cost function is the cross-entropy cost, also known as “log loss” or “logistic loss”: J(θ) = − m ∑ k 1 ∑ [y (i) ln hθ (X (i) ) + (1 − y (i) ) ln(1 − hθ (X (i) ))] m i j where m is the total number of training examples and k is the number of output neurons. CHAPTER 14. NEURAL NETS 391 14.6. TRAINING NEURAL NETWORKS 392 The partial derivatives of the cross-entropy cost with respect to W N and bN are (for brevity, we’ll notate f N (NETN ) as simply f (n)): k ∑ ∂J (y − f (n)) = f ′ (n)X N ∂W j f (n)(1 − f (n)) k ∑ ∂J (y − f (n)) f ′ (n) = N ∂b f (n)(1 − f (n)) j This has the advantage that for some activation functions f , such as the sigmoid function, the activation function’s derivative f ′ cancels out, thus avoiding the training slowdown that can occur with the MSE. However, as mentioned before, this saturation occurs with only some activation functions (like the sigmoid function). This isn’t a problem, for instance, with linear activation functions, in which case quadratic cost is appropriate (though neural nets with linear activation functions are limited in what they can learn). Thus we have: δN = ∂J (f N )′ (NETN ) ∂OUTN Log-likelihood cost function The log-likelihood cost function is defined as, for a single training example: − ln fyN (NETN y ) That is, given an example that belongs to class y , we take the natural log of the value outputted by the output node corresponding to the class y (typically this is the y th node, since you’d have an output node for each class). If fyN (NETN y ) is close to 1, then the resulting cost is low; the further it is from 1, the larger the value is. This is assuming that the output node’s activation function outputs probability-like values (such as is the case with the softmax function). This cost function’s partial derivatives with respect to W N and bN work out to be: ∂J = f N−1 (n)(f N (n) − y ) N ∂W ∂J = f N (n) − y ∂bN For brevity, we’ve notated f N (NETN ) as simply f N (n), and the same for f N−1 (n); for the latter n = NETN−1 . (TODO clean this notation up) 392 CHAPTER 14. NEURAL NETS 393 14.6. TRAINING NEURAL NETWORKS Note that for softmax activation functions, we avoid the saturation problem with this cost function. Thus softmax output activations and the log-likelihood cost functions are a good pairing for problems requiring probability-like outputs (such as with classification problems). Common loss functions Loss Function Propagation Backpropagation Square y = 12 (x − d)2 ∂E ∂x = (x − d)T ∂E ∂y Log, c = ±1 y = log(1 + e −cx ) ∂E ∂x = Hinge, c = ±1 y = max(0, m − cx) ∂E ∂x = −cI{cx < m} ∂E ∂y LogSoftMax, c = 1 . . . k y = log( MaxMargin, c = 1 . . . k y = [maxk̸=c {xk + m} − xc ]+ 14.6.7 ∑ k e xk ) − x c −c ∂E 1+e cx ∂y e ∑ [ ∂E e xk − δsc ) ∂E ∂x ]s = ( ∂y xs k [ ∂E ∂x ]s = (δsk ∗ − δsc )I{E > 0} ∂E ∂y Weight initialization What are the best values to initialize weights and biases to? Given normalized data, we could reasonably estimate that roughly half the weights will be negative and roughly half will be positive. As a result, it may seem intuitive to initialize all weights to zero. But you should not - this causes every neuron to have the same output, which causes them to have the same gradients during backpropagation, which causes them to all have the same parameter updates. Thus none of the neurons will differentiate. Alternatively, we could set each neuron’s initial weights to be a random vector from a standard multidimensional normal distribution (mean of 0, standard deviation of 1), scaled by some value, e.g. 0.001 so that they are kept very small, but still non-zero. This process is known as symmetry breaking. The random initializations allow the neurons to differentiate themselves during training. However, this can become problematic. Consider that the net input to a neuron is: NET = W · X + b The following extends to the general case, but for simplicity, consider an input X that is all ones, with dimension d. Then NET is a sum of d +1 (plus one for the bias) standard normally distributed independent random variables. The sum of n normally distributed independent random variables is: CHAPTER 14. NEURAL NETS 393 14.6. TRAINING NEURAL NETWORKS 394 n ∑ N( i µi , n ∑ σi2 ) i That is, it is also a normal distribution. Thus NET will still have a mean of 0, but it’s standard deviation will be √ d + 1. If for example, d = 100, this leaves us with a standard deviation of ∼ 10. This is quite large, and implies that NET may take on large values due to how we initialized our weights. If NET takes on large values, we may run into saturation problems given an activation function such as sigmoid, which then leads to slow training. Thus, poor weight initialization can lead to slow training. This is most problematic for deep networks, since they may reduce the gradient signal that flows backwards by too much (in a weaker version of the gradient “killing” effect). As the number of inputs to a neuron grows, so too will its output’s variance. This can be controlled for (calibrated) by scaling its weight vector by the square root of its “fan-in” (its number of inputs), √ so you should divide the standard multidimensional distribution sampled random vector by n, √ where n is the number of the neuron’s inputs. For ReLUs, it is recommended you instead divide by 2/n. (Karpathy’s CS231n notes provides more detail on why this is.) An alternative to this fan-in scaling for the uncalibrated variances problem is sparse initialization, which is to set all weights to 0, and then break symmetry by randomly connecting every neuron to some fixed number (e.g. 10) of neurons below it by setting those weights to ones randomly sampled from the standard normal distribution like mentioned previously. Biases are commonly initialized to be zero, though if using ReLUs, then you can set them to a small value like 0.01 so all the ReLUs fire at the start and are included in the gradient backpropagation update. Elsewhere it is recommended that √ ReLU weights should be sampled from zero-mean Gaussian distribution with standard deviation of d2in . Elsewhere it is recommended that you sample your weights uniformly from [−b, b], where: √ b= 6 Hk + Hk+1 where Hk and Hk+1 are the sizes of the hidden layers before and after the weight matrix. 14.6.8 Shuffling & curriculum learning Generally you should shuffle your data every training epoch so the network does not become biased towards a particular ordering. However, there are cases in which your network may benefit from a meaningful ordering of input data; this approach is called curriculum learning. 394 CHAPTER 14. NEURAL NETS 395 14.6.9 14.6. TRAINING NEURAL NETWORKS Gradient noise Adding noise from a Gaussian distribution to each update, i.e. gt,i = gt,i + N(0, σt2 ) with variance annealed with the following schedule: σt2 = η (1 + t)γ has been shown to make “networks more robust to poor initialization and helps training particularly deep and complex networks. They suspect that the added noise gives the model more chances to escape and find new local minima, which are more frequent for deeper models.” (An overview of gradient descent optimization algorithms, Sebastian Ruder) 14.6.10 Adversarial examples Adding noise to input, such as in the accompanying figure, can throw off a classifier. Few strategies are robust against these tricks, but one approach is to generate these adversarial examples and include them as part of the training set. Adversarial example source 14.6.11 Gradient Checking When you write code to compute the gradient, it can be very difficult to debug. Thus it is often useful to check the gradient by numerically approximating the gradient and comparing it to the computed gradient. Say our implemented gradient function is g(θ). We want to check that g(θ) = ∂J(θ) ∂θ . We choose some ϵ, e.g. ϵ = 0.0001. It should be a small value, but not so small that we run into floating point precision errors. CHAPTER 14. NEURAL NETS 395 14.7. NETWORK ARCHITECTURES 396 Then we can numerically approximate the gradient at some scalar value θ: J(θ + ϵ) − J(θ − ϵ) 2ϵ When θ is a vector, as is more often the case, we instead compute: J(θ(i+) − J(θ(i−) ) 2ϵ Where: • θ(i+) = θ + (ϵ × ei ) • θ(i−) = θ − (ϵ × ei ) • ei is the i th is the basis vector (i.e. it is 0 everywhere except at the i th element, where it is 1) 14.6.12 Training tips Start training with small, unequal weights to avoid saturating the network w/ large weights. If all the weights start equal, the network won’t learn anything. • Normalize real-valued data (subtract mean, divide by standard deviation (see part on data preprocessing)) • Decrease the learning rate during training • Use minibatches for a more stable gradient (e.g. use stochastic gradient descent) • Use momentum to get through plateaus 14.6.13 Transfer Learning The practice of transfer learning involves taking a neural net trained for another task and applying it to a different task. For instance, if using an image classification net trained for one classification task, you can use that same network for another, truncating the output layer, that is, take the vectors from the second-to-last layer and use those as feature vectors for other tasks. 14.7 Network architectures The architecture of a neural network describes how its layers are structured - e.g. how many layers there are, how many neurons in each, and how they are connected. Neural networks are distinguished by their architecture. The general structure of a neural network is input layer -> 0 or more hidden layers -> output layer. 396 CHAPTER 14. NEURAL NETS 397 14.8. OVERFITTING Neural networks always have one input layer, and the size of that input layer is equal to the input dimensions (i.e. one node per feature), though sometimes you may have an additional bias node. Neural networks always have one output layer, and the size of that output layer depends on what you’re doing. For instance, if your neural network will be a regressor (i.e. for a regression problem), then you’d have a single output node (unless you’re doing multivariate regression). Same for binary classification. However with softmax (more than just two classes) you have one output node per class label, with each node outputting the probability the input is of the class associated with the node. If your data is linearly separable, then you don’t need any hidden layers (and you probably don’t need a neural network either and a linear or generalized linear model may be plenty). Neural networks with additional hidden layers become difficult to train; networks with multiple hidden layers are the subject of deep learning (detailed below). For many problems, one hidden layer suffices, and you may not see any performance improvement from adding additional hidden layers. A rule of thumb for deciding the size of the hidden layer is that the size should be between the size between the input size and output size (for example, the mean of their sizes). 14.8 Overfitting Because neural networks can have so many parameters, it can be quite easy for them to overfit. Thus it is something to always keep an eye out for. This is especially a problem for large neural networks, which have huge amounts of parameters. As the network grows in number of layers and size, the network capacity increases, which is to say it is capable of representing more complex functions. Simpler networks have fewer local minima, but they are easier to converge to and tend to perform worse (they have higher loss). There is a great deal of variance across these local minima, so the outcome is quite sensitive to the random initialization - some times you land in a good local minima, sometimes not. More complex networks have more local minima, but they tend to perform better, and there is less variance across how these local minima perform. Higher-capacity networks run a greater risk of overfitting, but this overfitting can be (preferably) mitigated by other methods such as L2 regularization, dropout, and input noise. So don’t let overfitting be the sole reason for going with a simpler network if a larger one seems appropriate. Here are regularization examples for the same data from the previous image, with the neural net for 20 hidden neurons: As you can see, regularization is effective at counteracting overfitting. Another simple, but possibly expensive way of reducing overfitting is by increasing the amount of training data - it’s unlikely to overfit many, many examples. However, this is seldom a practical option. Generally, the methods for preventing overfitting include: • Get more data, if possible CHAPTER 14. NEURAL NETS 397 14.8. OVERFITTING 398 More complex network, more complex functions 398 CHAPTER 14. NEURAL NETS 399 14.8. OVERFITTING Regularization strength CHAPTER 14. NEURAL NETS 399 14.8. OVERFITTING 400 • Limit your model’s capacity so that it can’t fit the idiosyncrasies of the data you have. With neural networks, this can be accomplished by: • limiting the number of hidden layers and/or number of units per layer • start with small weights and stop learning early (so the weights can’t get too large) • weight decay: penalize large weights using penalties on their squared values (L2) or absolute values (L1) • adding Gaussian noise (i.e. xi | + N(0, σi2 ) to inputs • Average many different models • Use different models with different forms, or • Train model on different subsets of the training data (“bagging”) • Use a single neural network architecture, but learn different sets of weights, and average the predictions across these different sets of weights 14.8.1 Regularization Regularization techniques are used to prevent neural networks from overfitting. L2 Regularization L2 regularization is the most common form of regularization. We penalize the squared magnitude of ∑ all parameters (weights) as part of the objective function, i.e. we add λw 2 to the objective function (this additional term is called the regularization term, and λ is an additional hyperparameter, the ∑ regularization parameter). It is common to include 12 , i.e. use 12 λw 2 , so the gradient of this term wrt to w is just λw instead of 2λw $. This avoids the network relying heavily on a few weights and encourages it to use all weights a little. L2 regularization is sometimes called weight decay since the added regularization term penalizes large weights, favoring smaller weights. So a regularized cost function J, from the original unregularized cost function J0 , is simply: J = J0 + λ ∑ 2 w 2m w This affects the partial derivative of the cost function with respect to weights in a simple way (again, biases are not included, so it does not change that partial derivative): ∂J ∂J0 λ = + w ∂w ∂w m So your update rule would be: m ηλ η ∑ ∂Ji w →w = − m m i ∂w ′ 400 CHAPTER 14. NEURAL NETS 401 14.8. OVERFITTING Note that biases are typically not included by convention; regularizing them usually does not have an impact on the network’s generalizability. L1 Regularization Similar to L2 regularization, except that the regularization term added to the objective function is ∑ λ|w |; that is, the sum of the absolute values of the weights with a regularization parameter λ. The main difference between L1 and L2 regularization is that L1 regularization shrinks weights by a constant amount, whereas L2 regularization shrinks weights by an amount proportional to the weights themselves. This is made clearer by considering the derived update rules from gradient descent. L1 regularization has the effect of causing weight vectors to become sparse, such that neurons only use a few of their inputs and ignore the rest as “noise”. Generally L2 regularization is preferred to L1. For L1, this partial derivative of the cost function wrt the weights is: ∂J ∂J0 λ = + sign(W N ) ∂W N ∂W N m This ends up leading to the following update rule: w → w′ = w − m ηλ η ∑ ∂Ji sign(w ) − m m i ∂w Note that we say that sign(0) = 0. Compare this with the update rule for L2 regularization: w → w′ = w − m ηλ η ∑ ∂Ji w− m m i ∂w In L2 regularization, we subtract a term weighted by w , whereas in L1 regularization, the subtracted term is affected only by the sign of w . Elastic net regularization This is just the combination of L1 and L2 regularization, such that the term introduced to the ∑ objective function is λ1 |w | + λ2 w 2 . Max norm constraints This involves setting an absolute upper bound on the magnitude of the weight vectors; that is, after updating the parameters/weights, clamp every weight vector so that it satisfies ||w ||2 < c, where c is some constant (the maximum magnitude). CHAPTER 14. NEURAL NETS 401 14.8. OVERFITTING 402 Dropout Dropout is a regularization method which works well with the others mentioned so far (L1, L2, maxnorm). It does not involve modifying cost functions. Rather, the network itself is modified. During training, we specify a probability p. At the start of each training epoch, we only keep a neuron active with that probability p, otherwise we set its output to zero. If the neuron’s output is set to 0, that has the effect of temporarily “removing” that neuron for that training iteration. At the end of the epoch, all neurons are restored. This dropout is applied only at training time and applied per-layer (that is, it is applied after each layer, see the code example below). This prevents the network from relying too much on certain neurons. One way to think about this is that, for each training step, a sub-network is sampled from the full network, and only those parameters are updated. Then on the next step, a different sub-sample is taken and updated, and so on. To put it another way, dropping out neurons in this way has the effect of training multiple neural networks simultaneously. If we have multiple networks overfit to different training data, they are unlikely to all overfit in the same way. So their average should provide better results. This has the additional advantage that neurons must learn to operate in the absence of other neurons, which can have the effect of the network learning more robust features. That is, the neurons of the network should be more resilient to the absence of some information. A network after dropout is applied to each layer in a training iteration source At test time, all neurons are active (i.e. we don’t use dropout at test time). There will be twice as many hidden neurons active as there were in training, so all weights are halved to compensate. We must scale the activation functions by p to maintain the same expected output for each neuron. Say x is the output of a neuron without dropout. With dropout, the neuron’s output has a chance p of being set to 0, so its expected output becomes px (more verbosely, it has 1 − p chance of 402 CHAPTER 14. NEURAL NETS 403 14.8. OVERFITTING becoming 0, so its output is px + (1 − p)0, which simplifies to px). Thus we must scale the outputs (i.e. the activation functions) by p to keep the expected output consistent. This scaling can be applied at training time, which is more efficient - this technique is called inverted dropout. For comparison, here is an implementation of regular dropout and an implementation of inverted dropout (source from: https://cs231n.github.io/neural-networks-2/) # Dropout p = 0.5 # probability of keeping a unit active. higher = less dropout def train_step(X): ””” X contains the data ””” # forward pass for example 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = np.random.rand(*H1.shape) < p # first dropout mask H1 *= U1 # drop! H2 = np.maximum(0, np.dot(W2, H1) + b2) U2 = np.random.rand(*H2.shape) < p # second dropout mask H2 *= U2 # drop! out = np.dot(W3, H2) + b3 # backward pass: compute gradients... (not shown) # perform parameter update... (not shown) def predict(X): # ensembled forward pass H1 = np.maximum(0, np.dot(W1, X) + b1) * p # NOTE: scale the activations H2 = np.maximum(0, np.dot(W2, H1) + b2) * p # NOTE: scale the activations out = np.dot(W3, H2) + b3 # Inverted dropout p = 0.5 # probability of keeping a unit active. higher = less dropout def train_step(X): # forward pass for example 3-layer neural network H1 = np.maximum(0, np.dot(W1, X) + b1) U1 = (np.random.rand(*H1.shape) < p) / p # first dropout mask. Notice /p! H1 *= U1 # drop! H2 = np.maximum(0, np.dot(W2, H1) + b2) U2 = (np.random.rand(*H2.shape) < p) / p # second dropout mask. Notice /p! H2 *= U2 # drop! out = np.dot(W3, H2) + b3 CHAPTER 14. NEURAL NETS 403 14.9. HYPERPARAMETERS 404 # backward pass: compute gradients... (not shown) # perform parameter update... (not shown) def predict(X): # ensembled forward pass H1 = np.maximum(0, np.dot(W1, X) + b1) # no scaling necessary H2 = np.maximum(0, np.dot(W2, H1) + b2) out = np.dot(W3, H2) + b3 Regularization recommendations It is most common to use a single, global L2 regularization strength that is cross-validated. It is also common to combine this with dropout applied after all layers. The value of p = 0.5 is a reasonable default, but this can be tuned on validation data. https: //cs231n.github.io/neural-networks-2/ 14.8.2 Artificially expanding the training set In addition to regularization, training on more data can help prevent overfitting. This, unfortunately, is typically not a practical option. However, the training set can be artificially expanded by taking existing training data and modifying it in a way we’d expect to see in the real world. For instance, if we were training a network to recognize handwritten digits, we may take our examples and rotate them slightly, since this could plausibly happen naturally. A related technique is training on adversarial examples (detailed elsewhere), in which training examples are modified to be deliberately hard for the network to classify, so that it can be trained on more ambiguous/difficult examples. The most common approach to dealing with overfitting is to apply some kind of regularization. 14.9 Hyperparameters There are many hyperparameters to set with neural networks, such as: • architecture decisions – – – – number of layers number of units per layer type of unit etc • weight penalty • learning rate • momentum 404 CHAPTER 14. NEURAL NETS 405 14.9. HYPERPARAMETERS • whether or not to use dropout • etc 14.9.1 Choosing hyperparameters TODO See: https://cs231n.github.io/neural-networks-3/#anneal Not only are there many hyperparameters for neural networks; it can also be very difficult to choose good ones. You could do a naive grid search and just try all possible combinations of hyperparameters, which is infeasible because it blows up in size. You could randomly sample combinations as well, but this still has the problem of repeatedly trying hyperparameter values which may have no effect. Instead, we can apply machine learning to this problem and try and learn what hyperparameters may perform well based on the attempts thus far. In particular, we can try and predict regions in the hyperparameter space that might do well. We’d want to also be able to be explicit about the uncertainty in our prediction. We can use Gaussian process models to do so. The basic assumption of these models is that similar inputs give similar outputs. However, what does “similar” mean? Is 200 hidden units “similar” to 300 hidden units or not? Fortunately, such models can also learn this scale of similarity for each hyperparameter. These models predict a Gaussian distribution of values for each hyperparameter (hence the name). A method for applying this: • keep track of the best hyperparameter combination so far • pick a new combination of hyperparameters such that the expected improvement of the best combination is big So we might try a new combination, and it might not do that well, but we won’t have replaced our current best. This method for selecting hyperparameters is called Bayesian (hyperparameter) optimization, and is a better approach than by picking hyperparameters by hand (less prone to human error). 14.9.2 Tweaking hyperparameters A big challenge in designing a neural network is calibrating its hyperparameters. From the start, it may be difficult to intuit what hyperparameters need tuning. There are so many to choose from: network architecture, number of epochs, cost function, weight initialization, learning rate, etc. There are a few heuristics which may help. When the learning rate η is set too high, you typically see constant oscillation in the error rate as the network trains. This is because with too large a learning rate, you may miss the minimum in CHAPTER 14. NEURAL NETS 405 14.10. DEEP NEURAL NETWORKS 406 the error surface by “jumping” too far. Thus once you see this occurring, it’s a hint to try a lower learning rate. Learning rates which are too low tend to have a slow decrease in error over training. You can try higher learning rates if this seems to be the case. The learning rate does not need to be fixed. When starting out training, you may want a high learning rate to quickly get close to a minimum. But once you get closer, you may want to decrease the learning rate to carefully identify the best minimum. The specification of how the learning rate decreases is called the learning rate schedule. Some places recommend using a learning rate in the form: ηt = η0 (1 + η0 λt)−1 Where η0 is the initial learning rate, ηt is the learning rate for the tth example, and λ is another hyperparameter. For the number of epochs, we can use a strategy called “early stopping”, where we top once some performance metric (e.g. classification accuracy) appears to stop improving. More precisely, “stop improving” can mean when the performance metric doesn’t improve for some n epochs. However, neural networks sometimes plateau for a little bit and then keep on improving. In which case, adopting an early stopping strategy can be harmful. You can be somewhat conservative and set n to a higher value to play it safe. 14.10 Deep neural networks A deep neural network is simply a neural network with more than one hidden layer. Deep learning is the field related to deep neural networks. These deep networks can perform much better than shallow networks (networks with just one hidden layer) because they can embody a complex hierarchy of concepts. Many problems can be broken down into subproblems, each of which can be addressed by a separate neural network. Say for example we want to know whether or not a face is in an image. We could break that down (decompose it) into subproblems like: • • • • is there an eye? is there an ear? is there a nose? etc. We could train a neural network on each of these subproblems. We could even break these subproblems further (e.g. “Is there an eyelash?”, “Is there an iris?”, etc) and train neural networks for those, and so on. 406 CHAPTER 14. NEURAL NETS 407 14.10. DEEP NEURAL NETWORKS Then if we want to identify a face, we can aggregate these networks into a larger network. This kind of multi-layered neural net is a deep neural network. Multilayer nns must have nonlinear activation functions, otherwise they are equivalent to a single layer network aggregating its weights. That is, a 2 layer network has weight vectors W1 and W2 and input X. The network computes (XW1 )W2 , which is equivalent to X(W1 W2 ), so the network is equivalent to a single layer network with weight vectors W1 W2 Training deep neural networks (that is, neural networks with more than one hidden layer) is not as straightforward as it is with a single hidden layer - a simple stochastic gradient descent + backpropagation approach is not as effective or quick. This is because of unstable gradients. This has two ways of showing up: • Vanishing gradients, in which the gradient gets smaller moving backwards through the hidden layers, such that earlier layers learn very slowly (and may not learn at all). • Exploding gradients, in which the gradient gets much larger moving backwards through the hidden layers, such that earlier layers cannot find good parameters. These unstable gradients occur because gradients in earlier layers are the products of the later layers (refer to backpropagation for details, but remember that the δ i for layer i is computed from δ i+1 ). Thus if these later terms are mostly < 1, we will have a vanishing gradient. If these later terms are > 1, they can get very large and lead to an exploding gradient. 14.10.1 Unstable gradients Certain neural networks, such as RNNs, can have unstable gradients, in which gradients may grow exponentially (an exploding gradient) or shrink exponentially until it reaches zero (a vanishing gradient). With exploding gradients, the minimum is not found because, with such a large gradient, the steps don’t effectively search the space. With vanishing gradients, the minimum is not found because a gradient of zero means the space isn’t searched at all. Unstable gradients can occur as a result of drastic changes in the cost surface, as illustrated in the accompanying figure (from Pascanu et al via http://peterroelants.github.io/posts/rnn_ implementation_part01/). In the figure, the large jump in cost leads to a large gradient which causes the optimizer to make an exaggerated step. There are methods for dealing with unstable gradients, including: • • • • Gradient clipping (e.g. limiting g to g = Hessian-Free Optimization Momentum Resilient backpropagation (Rprop) CHAPTER 14. NEURAL NETS t ||g||2 if ||g||2 > t, where t is some clipping threshold) 407 14.10. DEEP NEURAL NETWORKS 408 Resilient backpropgation (Rprop) Normally, weights are updated by the size of the gradient (typically scaled by some learning rate). However, as demonstrated above, this can lead to an unstable gradient. Resilient backpropagation ignores the size of the gradient and only considers its sign and then uses two hyperparameters, η − , η + (η + > 1) to determine the size of the update. η − . If the sign of the gradient changes in an iteration, the weight update ∆ is multiplied by η − , i.e. ∆ = ∆η − . If the gradient’s sign doesn’t change, the weight update ∆ is multiplied by η + , i.e. ∆ = ∆η + . If the gradient’s sign changes, this usually indicates that we have passed through a local minima. Then the weight is updated by this computed value in the opposite direction of its gradient: W → W ′ = W − sign( ∂J )∆ ∂W Typically, η + = 1.2, η − = 0.5. This is essentially separate adaptive learning rates but ignoring the size of the gradient and only look at the sign. That is, we increase weights multiplicatively by η + if the last two gradient signs agree, otherwise, we decrease the step size multiplicatively by η − . As with separate adaptive learning rates, we generally want to limit the range of step sizes so that it can’t be too small or too large. Rprop is meant for full batch learning or for very large mini-batches. To use this technique with mini-batches, see Rmsprop. 14.10.2 Rmsprop Rmsprop is the mini-batch version of Rprop. It computes a moving average, MA, of the squared gradient for each parameter: MA = λMA + (1 − λ)( ∂J 2 ) ∂W Then normalizes the gradient by dividing by the square root of this moving average: ∂J 1 √ ∂W MA 408 CHAPTER 14. NEURAL NETS 409 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) Rmsprop can be used with momentum as well (i.e. update the velocity with this modified gradient). The basic idea behind rmsprop is to adjust the learning rate per-parameter according to the (smoothed) sum of the previous gradients. Intuitively this means that frequently occurring features get a smaller learning rate (because the sum of their gradients is larger), and rare features get a larger learning rate. http://www.wildml.com/2015/10/ recurrent-neural-network-tutorial-part-4-implementing-a-grulstm-rnn-with-python-and-theano/ 14.11 Convolutional Neural Networks (CNNs) In a regular neural network, the relationship between a pixel and one that is next to it is the same as its relationship with a pixel far away - the structural information of the image is totally lost. Convolutional nets are capable of encoding this structural information about the image; as a result, they are especially effective with image-based tasks. Convolutional nets are based on three ideas: • local receptive fields • shared weights • pooling 14.11.1 Local receptive fields A regular neural network is fully-connected in that every node from a layer i is connected to each node in the layer i + 1. This is not the case with convolutional nets. Typically we think of a layer as a line of neurons. With convolutional nets, it is more useful to think of the neurons arranged in a grid. (Note: the following images are from http://neuralnetworksanddeeplearning.com/chap6.html TODO replace the graphics) We do not fully connect this input layer to the hidden layer (which we’ll call a convolutional layer). Rather, we connect regions of neurons to neurons in the hidden layer. These regions are local receptive fields, local to the neuron at their center (they may more simply be called windows). We can move across local receptive fields one neuron at a time, or in greater movements. These movements are called the stride length. These windows end up learning to detect salient features, but are less sensitive to where exactly they occur. For instance, for recognizing a human face, it may be important that we see an eye in one region, but it doesn’t have to be in a particular exact position. A filter (also called a kernel) function is applied to each window to transform it into another vector (which is then passed to a pooling layer, see below). CHAPTER 14. NEURAL NETS 409 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) 410 A layer as a grid Local receptive fields map to a neuron in the hidden layer 410 CHAPTER 14. NEURAL NETS 411 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) Moving across fields at a stride length of 1 One architectural decision with CNNs is the use of wide convolution or narrow convolution. When you reach the edges of your input (say, the edges of an image), do you stop there or do you pad the input with zeros (or some other value) so we can fit another window? Padding the input is wide convolution, not padding is narrow convolution. Note that, as depicted above, narrow convolution will yield a smaller feature map of size: input shape - filter shape + 1. Note that this hyperparameter is sometimes called “border mode”. A border mode of “valid” is equivalent to a narrow convolution. There are a few different ways of handling padding for a wide convolution. Border modes of “half” (also called “same”) and “full” correspond to different padding strategies. Say we have a filter of size r × c (where r is rows and c is columns). For a border mode of “half”/“same”, we pad the input with a symmetric border of r //2 rows and c//2 columns (where // indicates integer division). When r and c are both odd, the feature map has the same shape as the input. There is also a “full” border mode which pads the input with a symmetric border of r − 1 rows and c − 1 columns. This is equivalent to applying the filter anywhere it overlaps with a pixel and yields a feature map of size: input shape + filter shape - 1. For example, say we have the following image: xxx xxx xxx Say we have a 3x3 filter. For a border mode of “half”/“same”, we the padded image would look like (padding is indicated with o): CHAPTER 14. NEURAL NETS 411 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) 412 ooooo oxxxo oxxxo oxxxo ooooo For a border mode of “full”, the padded image would instead be: ooooooo ooooooo ooxxxoo ooxxxoo ooxxxoo ooooooo ooooooo 14.11.2 Shared weights Another change here is that the hidden layer has one set of weights and a bias that is shared across the entire layer (these weights and biases are accordingly referred to as shared weights and the shared bias, and together, they define a filter or a kernel). As a result, if we have receptive fields of m × m size, the output of the i, jth neuron in the hidden layer looks like: f (b + m−1 ∑ m−1 ∑ Wk,l OUT0i+k,j+l ) k=0 l=0 Where W ∈ Rm×m is the array of shared weights and OUT0x,y is the output of the input neuron at position x, y . Another way of writing the above is: f (b + W ∗ OUT0 ) Where ∗ is the convolution operator, which is like a blurring/mixing of functions. In this context, it is basically a weighted sum. The consequence of this sharing of weights and biases is that this layer detects the same feature across different receptive fields. For example, this layer could detect vertical edges anywhere in the image. If an edge shows up in the upper-right part of the image, the corresponding input neuron for that receptive field will fire. If an edge shows up in the lower-left part of the image, the corresponding input neuron for that receptive field will also fire, due to the fact that they all share weights and a bias. 412 CHAPTER 14. NEURAL NETS 413 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) As a result of this property, this mapping between layers is often called a feature map. Technically, the kernel/filter output a feature map, which is to say they are not the same thing, but in practice the terms “kernel” and “filter” are often used interchangeably with “feature map”. For example, say we have a 3x3 filter: 0, 0, 0 0, 0, 0 0, 0, 0 Each position in the filter is a weight to be learned; here we have initialized them to 0 (not necessarily the best choice, but this is just an example). Each position in the filter lines up with a pixel as the filter slides across the image (as depicted above) Let’s say that the weights learned by the filter end up being the following: −1, −1, −1 −1, 10, −1 −1, −1, −1 Then say we place the filter over the following patch of pixels: 0, 0, 0 0, 255, 0 0, 0, 0 We want to combine (i.e. mix) the filter and the pixel values to produce a single pixel value (which will be a single pixel in the resulting feature map). We do so by convolving them as the sum of the element-wise product: (−1×0)+(−1×0)+(−1×0)+(−1×0)+(10×255)+(−1×0)+(−1×0)+(−1×0)+(−1×0) Pixel-by-pixel the feature map is produced in this way. We may include multiple feature maps/filters, i.e. have the input connect to many hidden layers of this kind (this is typically how it’s done in practice). Each layer would learn to detect a different feature. Another benefit to sharing weights and biases across the layer is that it introduces some resilience to overfitting - the sharing of weights means that the layer cannot favor peculiarities in particular parts of the training data; it must take the whole example into account. As a result, regularization methods are seldom necessary for these layers. CHAPTER 14. NEURAL NETS 413 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) 14.11.3 414 Pooling layers In addition to convolutional layers there are also pooling layers, which often accompany convolutional layers (often one per convolutional layer) and follow after them. Pooling layers produced a condensed version of the feature map they are given (for this reason, this process is also known as subsampling, so pooling layers are sometimes called subsampling layers). For example, a 2 × 2 neuron region of the feature map may be represented with only one neuron in the pooling layer. Mapping from feature map to a pooling layer There are a few different strategies for how this compression works. A common one is max-pooling, in which the pooling neuron just outputs the maximum value of its inputs. In some sense, maxpooling asks its region: was your feature present? And activates if it was. It isn’t concerned with where in that region the feature was, since in practice, its precise location doesn’t matter so much as its relative positioning to other features (especially with images). Another pooling technique is L2 pooling. Say a pooling neuron has an m × m input region of neurons coming from the layer i . Then it’s output is: v um−1 m−1 u∑ ∑ t (OUTij,k )2 j=0 k=0 Another pooling technique is average-pooling in which the average value of the input is output. There is also the k-max pooling method, which takes the top k values in each dimension, instead of just the top value as is with max-pooling. The result is a matrix rather than a vector. 14.11.4 Network architecture Generally, we have many feature maps (convolutional layers) and pooling layer pairs grouped together; conceptually it is often easier to think of these groups themselves as layers (called “convolutionalpooling layers”). 414 CHAPTER 14. NEURAL NETS 415 14.11. CONVOLUTIONAL NEURAL NETWORKS (CNNS) An example convolutional network The output layer is fully-connected (i.e. every neuron from the convolutional-pooling layer are connected to every neuron in the output layer). Often it helps to include another (or more) fully-connected layer just prior to the output layer. This can be thought of as aggregating and considering all the features coming from the convolutional-pooling layer. It is also possible to insert additional convolutional-pooling layers (this practice is called hierarchical pooling). Conceptually, these take the features output by the previous convolutional-pooling layer and extract higher-level features. The way these convolutional-pooling layers connect to each other is a little different. Each of this new layer’s input neurons (that is, the neurons in its first set of convolutional layers) takes as its input all of the outputs (within its local receptive field) from the preceding convolutional-pooling layer. For example, if the preceding convolutional-pooling layer has 20 layers in it, and we have receptive fields of size 5 × 5, then each of the input neurons for the new convolutional-pooling layer would have 20 × 5 × 5 inputs. 14.11.5 Training CNNs Backpropagation is slightly different for a convolutional net because the typical backpropagation assumes fully-connected layers. TODO add this 14.11.6 Convolution kernels CNNs learn a convolution kernel and (for images) apply it to every pixel across the image: CHAPTER 14. NEURAL NETS 415 14.12. RECURRENT NEURAL NETWORKS (RNNS) 416 Convolution kernel example source 14.12 Recurrent Neural Networks (RNNs) A recurrent neural network is a feedback neural network, that is, it is a neural net where the outputs of neurons are fed back into their inputs. They have properties which give them advantages over feed-forward NNs for certain problems. In particular, RNNs are well-suited for handling sequences as input. With machine learning, data is typically represented in vector form. This works for certain kinds of data, such as numerical data, but not necessarily for other kinds of data, like text. We usually end up coercing text into some vector representation (e.g. TF-IDF) and end up losing much of its structure (such as the order of words). This is ok for some tasks (such as topic detection), but for many others we are throwing out important information. We could use bigrams or trigrams or so on to preserve some structure but this becomes unmanageably large (we end up with very high-dimension vectors). Recurrent neural networks are able to take sequences as input, i.e. iterate over a sequence, instead of fixed-size vectors, and as such can preserve the sequential structure of things like text and have a stronger concept of “context”. Basically, an RNN takes in each item in the sequence and updates a hidden representation (its state) based on that item and the hidden representation from the previous time step. If there is no previous hidden representation (i.e. we are looking at the first item in the sequence), we can initialize it as either all zeros or treat the initial hidden representation as another parameter to be learned. Another way of putting this is that the core difference of an RNN from a regular feedforward network 416 CHAPTER 14. NEURAL NETS 417 14.12. RECURRENT NEURAL NETWORKS (RNNS) is that the output of a neuron is a function of its inputs and of its past state, e.g. OUTt = f (OUTt−1 Wr + Xt Wx ) Where Wr are the recursive weights. 14.12.1 Network architecture In the most basic RNN, the hidden layer have two inputs: the input from the previous layer, and the layer’s own output from the previous time step (so it loops back onto itself): Simple RNN network, with hidden nodes looping source This simple network can be visualized over time as well: Simple RNN network, with hidden nodes looping over time source Say we have a hidden layer L1 of size 3 and another hidden layer L2 of size 2. In a regular NN, the input to L2 is of size 3 (because that’s the output size of L1 ). In an RNN, L2 would have 3+2 inputs, 3 from L1 , and 2 from its own previous output. CHAPTER 14. NEURAL NETS 417 14.12. RECURRENT NEURAL NETWORKS (RNNS) 418 This simple feedback mechanism offers a kind of short-term memory - the network “remembers” the output from the previous time step. It also allows for variable-sized inputs and outputs - the inputs can be fed in one at a time and combined by this feedback mechanism. 14.12.2 RNN inputs The input item can be represented with one-hot encoding, i.e. each term is to a vector of all zeroes and one 1. For example, if we had the vocabulary the, mad, cat, the terms might be respectively represented as [1, 0, 0], [0, 1, 0], [0, 0, 1]. Another way to represent these terms is with an embedding matrix, in which each term is mapped to some index of the matrix which points to some n-dimensional vector representation. So the RNN learns vector representations for each term. Convolutional neural networks, and feed-forward neural networks in general, treat an input the same no matter when they are given it. For RNNs, the hidden representation is like (short-term) “memory” for the network, so context is taken into account for inputs; that is, an input will be treated differently depending on what the previous input(s) was/were. 14.12.3 Training RNNs (Note that RNNs train very slowly on CPUs; they train significantly faster on GPUs.) RNNs are trained using a variant of backpropagation called backpropagation through time, which just involves unfolding the RNN a certain number of time steps, which results in what is essentially a regular feedforward network, and then applying backpropagation: ∂E ∂E ∂OUTt ∂E = = Wr ∂OUTt−1 ∂OUTt ∂OUTt−1 ∂OUTt which starts with: ∂E ∂E = ∂y ∂OUTn Where OUTn is the output of the last layer. The gradients of the cost function wrt to the weights is computed by summing the weight gradients in each layer: n ∑ ∂E ∂E Xt = ∂Wx k=0 ∂OUTt n ∑ ∂E ∂E = OUTt−1 ∂Wr k=1 ∂OUTt 418 CHAPTER 14. NEURAL NETS 419 14.12. RECURRENT NEURAL NETWORKS (RNNS) This summing of the weight gradients at each time step is the main difference from regular feedforward networks, aside from that BPTT is basically just backpropagation on an RNN unrolled up to some time step t. However, if working with long sequences, this is effectively like training a deep network with many hidden layers (i.e. this is equivalent to an unrolled RNN), which can be difficult (due to vanishing or exploding gradients). In practice, it’s common to truncate the backpropagation by running it for only to a few time steps back. The vanishing gradient problem in RNNs means long-term dependencies won’t be learned - the effect of earlier steps “vanish” over time steps (this is the same problem of vanishing gradients in deep feedforward networks, given that an RNN is basically a deep neural net). Exploding gradients are more easily dealt with - it’s obvious when they occur (you’ll see NaNs, for instance), and you can clip them at some maximum value, which can be quite effective (refer to this paper) Some strategies for dealing with vanishing gradients: • vanishing gradients are sensitive to weight initialization, so proper weight initialization can help avoid them • ReLUs can work better as the nonlinear activation functions since they are not bounded by 1 as the sigmoid and tanh nonlinearities are Generally, however, Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) architectures are used instead of vanilla RNNs, which were designed for mitigating vanishing gradients (for the purpose of better learning long-range dependencies). 14.12.4 LSTMs This short-term memory of (vanilla) RNNs may be too short. RNNs may incorporate long short-term memory (LSTM) units instead, which just computes hidden states in a different way. With an LSTM unit, we have memory stored and passed through a more involved series of steps. This memory is modified in each step, with something being added and something being removed at each step. The result is a neural network that can handle longer-term context. These LSTM units have a three gates (in contrast to the single activation function vanilla RNNs have): • write (input) - controls the amount of current input to be remembered • read (output) - controls the amount of memory given as output to the next stage • erase (forget) - controls what part of the memory is erased or kept in the current time step TODO include Chris Olah’s LSTM diagrams: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ These gates are sigmoid functions combined with a pointwise multiplication operation. They are called gates because they tune how much of their input is passed on (i.e. sigmoids give a value in CHAPTER 14. NEURAL NETS 419 14.12. RECURRENT NEURAL NETWORKS (RNNS) 420 A LSTM unit source [0, 1], which can be thought as the percent of input to pass on). The parameters for these gates are learned. The input gate determines how much of the input is let through, the forget gate determines how much of the previous state is let through. We compute a new “memory” (i.e. the LSTM unit’s internal state) from the outputs of these gates. The output gate determines how much of this new memory to output as the hidden state. In more detail: The forget gate controls what is removed (“forgotten”) from the cell state. The input to the forget gate is the concatenation of the cell’s output from the previous step, OUTt−1 and the current input to the cell, Xt . The gate computes a value in [0, 1] (with the sigmoid function) for each value in the previous cell state Ct−1 ; the resulting value determines how much of that value to keep (1 means keep it all, 0 means forget all of it). So we are left with a vector of values in [0, 1], which we then pointwise multiply with the existing cell state to get the updated cell state. The output of a forget gate f at step t is: ˙ ft = sigmoid(Wf [OUT t−1 , Xt ] + bf ) Then our intermediate value of Ct is Ct′ = ft Ct−1 . Where Wf , bf are the forget gate’s weight vector and bias, respectively. 420 CHAPTER 14. NEURAL NETS 421 14.12. RECURRENT NEURAL NETWORKS (RNNS) The input gate controls what information gets stored in the cell state. This gate also takes as input the concatenation of OUTt−1 and Xt . We will denote its output at step t as it . Like the forget gate, this is a vector of values in [0, 1] which determine how much information gets through - 0 means none, 1 means all of it. A tanh function takes the same input and outputs a vector of candidate values, C̃t . We pointwise multiple this candidate value vector with the input gate’s output vector to get the vector that is passed to the cell state. This resulting vector is pointwise added to the updated cell state. ˙ it = sigmoid(Wi [OUT t−1 , Xt ] + bi ) ˙ C̃t = tanh(WC [OUT t−1 , Xt ] + bC ) Thus our final updated value of Ct is Ct = Ct′ + it C̃t . We don’t output this cell state Ct directly. Rather, we have yet another gate, the output gate (sometimes called a read gate) that outputs another vector with values in [0, 1], ot , which determines how much of the cell state is outputted. This gate again takes in as input the concatenation of OUTt−1 and Xt . So the output of the output gate is just: ot = sigmoid(Wo [OUTt−1 , Xt ] + bo ) To get the final output of the cell, we pass the cell state Ct through tanh and then pointwise multiply that with the output of the output gate: OUTt = ot tanh(Ct ) An RNN is can be thought of as an LSTM in which all input and output gates are 1 and all forget gates are 0, with an additional activation function (e.g. tanh) afterwards (LSTMs do not have this additional activation function). There are many variations of LSTMs (see this paper for empirical comparisons between some of them), the most common of which is the Gated Recurrent Unit (GRU). GRUs A gated recurrent unit (GRU) is a simpler LSTM unit; it includes only two gates (also both sigmoid functions) - the reset gate r and the update gate z. The reset gate determines how to mix the current input and the previous state and the update gate determines how much of the previous state to retain. A vanilla RNN is a GRU architecture in which all reset gates are 1 and all update gates are 0 (with an additional activation function; like LSTMs don’t have this additional nonlinearity). GRUs don’t CHAPTER 14. NEURAL NETS 421 14.12. RECURRENT NEURAL NETWORKS (RNNS) 422 have internal states like LSTM units do; there is no output gate so there is no need for an internal state. The cell state and its output are also merged as its hidden state, ht : ht−1 = OUTt−1 ˙ t−1 , Xt ]) zt = sigmoid(Wz [h ˙ t−1 , Xt ]) rt = sigmoid(Wr [h ˙ t ht−1 , Xt ]) h̃t = tanh(W [r ht = (1 − zt )ht−1 + zt h̃t OUTt = ht Peephole connections This LSTM variant just passes on the previous cell state, Ct−1 , to the forget and input gates, and the new cell state, Ct , to the output gate, that is, all that is changed is that: ˙ t−1 , OUTt−1 , Xt ] + bf ) ft = sigmoid(Wf [C ˙ t−1 , OUTt−1 , Xt ] + bi ) it = sigmoid(Wi [C ˙ t−1 , OUTt−1 , Xt ] + bo ) ot = sigmoid(Wo [C Update gates In this LSTM variant, the forget and input gates are combined into a single update gate. The value ft is computed the same, but it is instead just: it = 1 − ft Essentially, we just update enough information to replace what was forgotten. 14.12.5 BI-RNNs Bidirectional RNNs (BI-RNNs) are a variation on RNNs in which the RNN can not only look into the past, but it can also look into the “future”. The BI-RNN has two states, sif (the forward state) and sib (the backward state). The forward state sif is based on x1 , x2 , . . . , xi , whereas the backward state sib is based on xn , xn−1 , . . . , xi . These states are managed by two different RNNs, one which is given the sequence x1:n and the other is fed xn:1 (that is, the input in reverse). The output at position i is the concatenation of these RNNs’ output vectors, i.e. yi = [yif ; yib ]. 422 CHAPTER 14. NEURAL NETS 423 14.13. UNSUPERVISED NEURAL NETWORKS 14.12.6 Attention mechanisms In people, “attention” is a mechanism by which we focus on one particular element of our environment, such that our perception of the focused element is in high-fidelity/resolution, whereas surrounding elements are at a lower resolution. Attention mechanisms in recurrent neural networks emulate this behavior. This amounts to a weighted sum across input states (typically weights are normalized to sum to 1); higher weights indicate more “focus” or attention. For instance, consider neural machine translation models. Their basic form consists of two RNNs, one which takes an input sentence (the encoder) and one which produces the translated output sentence (the decoder). The encoder takes the input sentence, produces a sentence embedding (i.e. a single vector meant to encapsulate the sentence’s meaning), then the decoder takes that embedding and outputs the translated sentence. Representing a sentence as a single embedding is challenging, especially since earlier parts of the sentence may be forgotten. There are some architectures such as the bidirectional variant that help with this, but attention mechanisms can help so that the decoder has access to the full spread of inputs and can “focus” more on translating individual parts when appropriate. This means that instead of taking a single sentence embedding, each output word is produced through this weighted combination of all input states. Note that these attention weights are stored for each step since each step the model distributes its attention differently. This can add up quickly. Attention mechanisms can be thought of as an addressing system for selecting locations in memory (e.g. an array) in a weighted fashion. 14.13 Unsupervised neural networks The most basic one is probably the autoencoder, which is a feed-forward neural net which tries to predict its own input. While this isn’t exactly the world’s hardest prediction task, one makes it hard by somehow constraining the network. Often, this is done by introducing a bottleneck, where one or more of the hidden layers has much lower dimensionality than the inputs. Alternatively, one can constrain the hidden layer activations to be sparse (i.e. each unit activates only rarely), or feed the network corrupted versions of its inputs and make it reconstruct the clean ones (this is known as a denoising autoencoder). [https://www.metacademy.org/roadmaps/rgrosse/deep_learning] 14.13.1 Autoencoders Autoencoders are a feedforward neural network used for unsupervised learning. Autoencoders extract meaningful features by trying to output a reproduction of its input. That is, the output layer is the same size as its input layer, and it tries to reconstruct its input at the output layer. CHAPTER 14. NEURAL NETS 423 14.13. UNSUPERVISED NEURAL NETWORKS 424 Generally the output of an autoencoder is notated x̂. The first half (i.e. from the input layer up to the hidden layer) of the autoencoder architecture is called the encoder, and the latter half (i.e. from the hidden layer to the output layer) is called the decoder. Often the weights of the decoder, W ∗, are just the transpose of the weights of the encoder W , i.e. W ∗ = W T . We refer to such weights as tied weights. Essentially what happens is the hidden layer learns a compressed representation of the input (given that it is a smaller size than the input/output layers, this is called an undercomplete hidden layer, the learned representation is called an undercomplete representation), since it needs to be reconstructed by the decoder back to its original form. That is, the network needs to find some way of representing the input with less information. In some sense, we do this already with language, where we may represent a photo with a word (or a thousand words with a photo). Undercomplete hidden layers do a good job compressing data similar to its training set, but bad for other inputs. On the other hand, the hidden layer may be larger than the input/output layers, in which case it is called an overcomplete hidden layer and the learned representation of the input is an overcomplete representation. There’s no compression as a result, and there’s not guarantee that anything meaningful will be learned (since it can essentially just copy the input). However, overcomplete representation as a concept is appealing because if we are using this autoencoder to learn features for us, we may want to learn many features. So how can we learn useful overcomplete representations? Sparse autoencoders Using a hidden layer size smaller than your input is tricky - encoding a lot of information into fewer bits is quite challenging. Rather counterintuitively, a larger hidden layer helps, where some hidden units are randomly turned off during a training iteration - that way, the output isn’t a mere copy of the input, and learning is easier since there is more “room” to represent the input. Such an autoencoder is called a sparse autoencoder. In effect, what an autoencoder is learning is some higher-level representation of its input. In the case of an image, it may go from pixels to edges. We can stack these sparse autoencoders on top of each other, so that higher and higher-level representations are learned. The sparse autoencoder that goes from pixels to edges can go into another one that learns how to go from edges to shapes, for example. Denoising autoencoders A denoising autoencoder is a way of learning useful overcomplete representations. The general idea is that we want the encoder to be robust to noise (that is, to be able to reconstruct the original input even in the presence of noise). So instead of inputting x, we input x̃, which is just x with noise 424 CHAPTER 14. NEURAL NETS 425 14.13. UNSUPERVISED NEURAL NETWORKS added (sometimes called a corrupted input), and the network tries to reconstruct the noiseless x as its output. There are many ways this noise can be added, but two popular approaches: • for each component in an input, set it to 0 with probability v • adding Gaussian noise (mean 0, and some variance; this variance is a hyperparameter) #### Loss functions for autoencoders Say our neural network is f (x) = x̂. For binary inputs, we can use cross-entropy (more precisely, the sum of Bernoulli cross-entropies): l(f (x)) = − ∑ (xk log(x̂k )) + (1 − xk )(log(1 − x̂k )) k For real-valued inputs, we can use the sum of squared differences (i.e. the squared euclidean distance): l(f (x)) = 1∑ (x̂k − xk )2 2 k And we use a linear activation function at the output. Loss function gradient in autoencoders Note that if you are using tied weights, the gradient ∇W l(f (x (t) )) is the sum of two gradients; that is, it is sum of the gradients for W ∗ and W T . Contractive autoencoders A contractive autoencoder is another way of learning useful overcomplete representations. We do so by adding an explicit term in the loss that penalizes uninteresting solutions (i.e. that penalizes just copying the input). Thus we have a new loss function, extended from an existing loss function: l(f (x (t) )) + λ||∇x (t) h(x (t) )||2F Where λ is a hyperparameter and ∇x (t) h(x (t) ) is the Jacobian of the encoder, represented as h(x (t) ), and ||A||F is the Frobenius norm: v u∑ n um ∑ ||A||F = t |aij |2 i=1 j=1 CHAPTER 14. NEURAL NETS 425 14.13. UNSUPERVISED NEURAL NETWORKS 426 Where A is a m × n matrix. To put it another way, the Frobenius norm is the square root of the sum of the absolute squares of a matrix’s elements; in this case, the matrix is the Jacobian of the encoder. Intuitively, the term we’re adding to the loss (the squared Frobenius norm of the Jacobian) increases the loss if we have non-zero partial derivatives with the encoder h(x (t) ) with respect to the input; this essentially means we want to encourage the encoder to throw away information (i.e. we don’t want the encoder’s output to change with changes to the input; i.e. we want the encoder to be invariant to the input). We balance this out with the original loss function which, as usual, encourages the encoder to keep good information (information that is useful for reconstructing the original input). By combining these two conflicting priorities, the result is that the encoder keeps only the good information (the latter term encourages it to throw all information away, the former term encourages it to keep only the good stuff). The λ hyperparameter lets us tweak which of these terms to prioritize. Contractive vs denoising autoencoders Both perform well and each has their own advantages. Denoising autoencoders are simpler to implement in that they are a simple extension of regular autoencoders and do not require computing the Jacobian of the hidden layer. Contractive autoencoders have a deterministic gradient (since no sampling is involved; i.e. no random noise), which means second-order optimizers can be used (conjugate gradient, LBFGs, etc), and can be more stable than denoising autoencoders. Deep autoencoders Autoencoders can have more than one hidden layer but they can be quite difficult to train (e.g. with small initial weights, the gradient dies). They can be trained with unsupervised layer-by-layer pre-training (stacking RBMs), or care can be taken in weight initialization. Shallow autoencoders for pre-training A shallow autoencoder is just an autoencoder with one hidden layer. In particular, we can create a deep autoencoder by stacking (shallow) denoising autoencoders. This typically works better than pre-training with RBMs. Alternatively, (shallow) contractive autoencoders can be stacked, and they also work very well for pre-training. 426 CHAPTER 14. NEURAL NETS 427 14.13.2 14.13. UNSUPERVISED NEURAL NETWORKS Sparse Coding The sparse coding model is another unsupervised neural network. The general problem is that for each input x (t) , we want to find a latent representation h(t) such that: • h(t) is sparse (has many zeros) • we can reconstruct the original input x (t) as well as possible Formally: T 1∑ 1 min min ||x (t) − Dh(t) ||22 + λ||h(t) ||1 (t) D T 2 t=1 h Note that Dh(t) is the reconstruction x̂ (t) , so the term ||x (t) − Dh(t) ||22 is the reconstruction error. D is the matrix of weights; in the context of sparse coding it is called a dictionary matrix, and it is equivalent to an autoencoder’s output weight matrix. The term ||h(t) ||1 is a sparsity penalty, to encourage h(t) to be sparse, by penalizing its L1 norm. We constraint the columns of D to be of norm 1 because otherwise D could just grow large, allowing h(t) to become small (i.e. sparse). Sometimes the columns of D are constrained to be no greater than norm 1 instead of being exactly 1. 14.13.3 Restricted Boltzmann machines Restricted Boltzmann machines (RBMs) are a type of neural network used for unsupervised learning; it tries to extract meaningful features. Such methods are useful for when we have a small supervised training set, but perhaps abundant unlabeled data. We can train an RBM (or another unsupervised learning method) on the unlabeled data to learn useful features to use with the supervised training set - this approach is called semisupervised learning. 14.13.4 Deep Belief Nets Deep belief networks are a generative neural network. Given some feature values, a deep belief net can be run “backwards” and generate plausible inputs. For example, if you train a DBN on handwritten digits, it can be used to generate new images of handwritten digits. Deep belief nets are also capable of unsupervised and semi-supervised learning. In an unsupervised setting, DBNs can still learn useful features. CHAPTER 14. NEURAL NETS 427 14.14. OTHER NEURAL NETWORKS 14.14 14.14.1 428 Other neural networks Modular Neural Networks So say we have trained a neural net which has learned our function W , and given a word input, it outputs us the word’s high-dimensional vector representation. We can re-use this network in a modular fashion so that we construct a larger neural net which can take a fixed-size set of words as input. For example, the following network takes in five words, from which we get their representations, which are then passed into another network R to yield some output s. A modular neural network (Bottou (2011)) 14.14.2 Recursive Neural Networks Using modular neural networks like above is limiting in the fact that we can only accept a fixed number of inputs. We can get around this by adding an association module A, which takes two representations and merges them. As you can see, it can take either a reputation from a word (via a W module) or from a phrase (via another A module). We probably don’t want to merge words linearly though. Instead we might want to group words in some way: 428 CHAPTER 14. NEURAL NETS 429 14.14. OTHER NEURAL NETWORKS Using association modules (Bottou (2011)) A recursive neural network (Bottou (2011)) CHAPTER 14. NEURAL NETS 429 14.14. OTHER NEURAL NETWORKS 430 This kind of model is a “recursive neural network” (sometimes “tree-structured neural network”) because it has modules feeding into modules of the same type. 14.14.3 Nonlinear neural nets In typical NNs, the architecture of the network is specified before hand and is static - neurons don’t change connections. In a nonlinear neural net, however, the connections between neurons becomes dynamic, so that new connections may form and old connections may break. This is more like how the human brain operates. But so far at least, these are very complex and difficult to train. 14.14.4 Neural Turing Machines A Neural Turing Machine is a neural network enhanced with external addressable memory (and a means of interfacing with it). Like a Turing machine, it can simulate any arbitrary procedure - in fact, given an input sequence and a target output sequence, it can learn a procedure to map between the two on its own, trainable via gradient descent (as the entire thing is differentiable). The basic architecture of NTMs is that there is a controller (which is a neural network, typically an RNN, e.g. LSTM, or a standard feedforward network), read/write heads (the write “head” actually consists of two heads, an erase and an add head, but referred to as a single head), and a memory matrix Mt ∈ RN×M . Each row (of which there are N, each of size M) in the memory matrix is referred to as a memory “location”. Unlike a normal Turing machine, the read and write operations are “blurry” in that they interact in some way with all elements in memory (normal Turing machines address one element at a time). There is an attentional “focus” mechanism that constrains the memory interaction to a smaller portion - each head outputs a weighting vector which determines how much it interacts (i.e. reads or writes) with each location. At time t, the read head emits a (normalized) weighting vector over the N locations, wt . From this we get the M length read vector rt : rt = ∑ wt (i )Mt (i ) i At time t, the write head emits a weighting vector wt (note that the write and read heads each emit their own wt that is used in the context of that head) and an erase vector et that have M elements which line in the range (0,1)$. Using these vectors, the memory vectors Mt−1 (i ) (i.e. locations) from the previous time-step are updated: M̃t (i ) = Mt−1 [⊮ − wt (i )et ] 430 CHAPTER 14. NEURAL NETS 431 14.14. OTHER NEURAL NETWORKS Where ⊮ is a row vector of all ones and the multiplication against the memory location is point-wise. Thus a memory location is erased (all elements set to zero) if wt and et are all ones, and if either is all zeros, then the memory is unchanged. The write head also produces an M length add vector at , which is added to the memory after the erase step: Mt (i ) = M̃t (i ) + wt (i )at So, how are these weight vectors wt produced for each head? For each head, two addressing mechanisms are combined to produce its weighting vectors: • content-based addressing: focus attention on locations similar to the controller’s outputted values • location-based addressing: conventional lookup by location Content-based addressing Each head produces a length M key vector kt . kt functions as a lookup key; we want to find an entry in Mt most similar to kt . A similarity function K (e.g. cosine similarity) is applied to kt against all entries in Mt . The similarity value is multiplied by a “key strength” βt > 0, which can attenuate the focus of attention. Then the resulting vector of similarities is normalized by applying softmax. The resulting weighting vector is wtc : exp(βt K(kt , Mt (i ))) wtc (i ) = ∑ j exp(βt K(kt , Mt (j))) Location-based addressing The location-based addressing mechanism is used to move across memory locations iteratively (i.e. given a current location, move to this next location; this is called a rotational shift) and for random-access jumps. Each head outputs a scalar interpolation gate gt in the range (0, 1). This is used to blend the old weighting outputted by the head, wt−1 , with the new weighting from the content-based addressing system, wtc . The result is the gated weighting wtg : wtg = gt wtc + (1 − gt )wt−1 If the gate is zero, the content weighting is ignored and only the previous weighting is used. CHAPTER 14. NEURAL NETS 431 14.15. NEUROEVOLUTION 432 (TODO not totally clear on this part) Next, the head also emits a shift weighting st which specifies a normalized distribution over the allowed integer shifts. For example, if shifts between -1 and 1 are allowed, st has three elements describing how much the shifts of -1, 0, and 1 are performed. One way of doing this is by adding a softmax layer of the appropriate size to the controller. Then we apply the rotation specified by st to wtg : w̃t (i ) = N−1 ∑ wtg (j)st (i − j) j=0 Over time, the shift weighting, if it isn’t “sharp”, can cause weightings to disperse over time. For example, with permitted shifts of -1, 0, 1 and st = [0.1, 0.8, 0.1], the single point gets slightly blurred across the three points To counter this, each head also emits a scalar γt ≥ 1 that is used to (re)sharpen the final weighting: w̃t (i )γt wt (i ) = ∑ γt j w̃t (j) Refer to the paper for example uses. 14.15 Neuroevolution Neuroevolution is the process of applying evolutionary algorithms to neural networks to learn their parameters (weights) and/or architecture (topology). Neuroevolution is flexible in its application; it may be used for supervised, unsupervised, and reinforcement learning tasks. An example application is state or action value evaluation, e.g. for game playing. With neuroevolution, an important choice is the genetic representation (genotype) of the neural network. For instance, if the architecture is fixed by the user, the weights can just be genetically represented as a vector of real numbers. Then the standard genetic algorithm (i.e. fitness, mutation, crossover, etc) can be applied. This simple representation of weights as a vector is called conventional neuroevolution (CNE). However, because the performance of a neural net is so dependent on topology, evolving the topology in addition to the weights can lead to better performance. One such method is NeuroEvolution of Augmenting Topologies (NEAT), of which there are many variations (e.g. RBF-NEAT, CascadeNEAT). Direct encoding the parameters are mapped one-to-one onto the vector; that is each weight is mapped to one number in the vector. However, there may be advantage to using indirect encodings, in which information in one part of the vector may be linked to another part. This compacts the genetic representation in that not every value must be represented (some are shared, mapping to multiple connections). 432 CHAPTER 14. NEURAL NETS 433 14.15. NEUROEVOLUTION A Compositional Pattern Producing Network (CPPN) is a neural network which functions as a patterngenerator. CPPNs typically include different activation functions (such as sine, for repeating patterns, or Gaussian, to create symmetric patterns). Although they were originally designed to produce twodimensional patterns (e.g. images), CPPNs may be used to evolved indirectly encoded neural networks - they “exploit geometric domain properties to compactly describe the connectivity pattern of a largescale ANN” (Riesi & Togelius). The CPPN itself may be evolved using NEAT - this approach is called HyperNEAT. A form of indirect encodings are developmental approaches in which the network develops new connections as the game is being played. In non-deterministic games, the fitness function may be noisy (since the same action can lead to different scores). One way around this is to average the performance over many independent plays. For complex problems, it sometimes is too difficult to evolve the network directly to that problem. Instead, staging (also called incremental evolution) may be preferred, where the network is evolved on simpler problems that gradually increase towards the original complex task. Similarly, transfer learning may be useful here as well. A challenge in evolving competitive AI is that there may not be a good enough opponent to play against and learn from. A method called competitive coevolution can be used, in which the fitness of one AI player depends on how it performs against another AI player drawn from the same or from another population. A similar method called cooperative coevolution, where fitness is instead based on its performance in collaboration with other players, may make more sense in other contexts. It may be adapted more generally by applying it at the individual neuron level - that is, each neuron’s fitness depends on how well it works with the other neurons in the network. The CoSyNE neuroevolution algorithm is based on this. In many cases, there is no single performance metric that can be used; rather, performance is evaluated based on many different dimensions. The simplest way around this is to combine these various metrics in some way - e.g. as a linear combination - but another way is cascading elitism, where “each generation contains separate selection events for each fitness function, ensuring equal selection pressure” (Riesi & Togelius). There is another class of algorithms called multiobjective evoluationary algorithm (MOEA) where multiple fitness functions are specified. These algorithms try to satisfy all their given objectives (fitness functions) and can also manage conflicts between objectives by identifying (mapping) them and deciding on tradeoffs. When a solution is found where no objective can be further improved without worsening another, the solution is said to be on the Pareto Front. One such MOEA is NSGA-II (Non-dominated Sorting Genetic Algorithm). There exist interactive evolution approaches in which a human can set or modify objectives during evolution, or even act as the fitness function themselves. Other ways humans can intervene include shaping, where the human can shape the environment to influence training, and demonstration, in which the human takes direct control and the network learns from that example. CHAPTER 14. NEURAL NETS 433 14.16. GENERATIVE ADVERSARIAL NETWORKS 14.16 434 Generative Adversarial Networks Generative models are typically trained with maximum-likelihood estimation which can become intractable (due to the normalization/partition term). Generative adversarial networks (GAN) are a method for training generative models with neural networks, trained with stochastic gradient descent instead of MLE. Sampling from the model is achieved by inputting noise; the outputs of the networks are the samples. A conditional generative adversarial network (cGAN) is an extension which allows the model to condition on external information. Note that denoising autoencoders have been used to achieve something similar. Denoising autoencoders learn to reconstruct empirical data X from noised inputs X̃ and can be sampled from by using a Markov chain, alternating between sampling reconstructed values P (X|X̃) and noise C(X̃|X), which eventually reaches a stationary distribution which matches the empirical density model established by the training data. (this method under the category of generative stochastic networks). GANs in contrast, have a much simpler sampling project (they don’t require a Markov chain), they require only noise input. A GAN has two components: • the generator G, which attempts to generate fraudulent, but convincing, samples • the discriminator D, which tries to distinguish fraudulent samples from genuine ones These two are pitted against each other in an adversarial game. As such, the objective function here is a minimax value function: min max(Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]) G D Breaking this down: 1. Train the discriminator to maximize the probability of the training data 2. Train the discriminator to minimize the probability of the data sampled from the generator. At the same time, train the generator on the opposite objective (maximize the probability that the discriminator assigns to its own samples). They are trained in alternation using stochastic gradient descent. This paper incorporates conditioning into this general GAN framework. Some condition y is established for generation; this restricts the generator in its output and the discriminator in its expected input. • Z is the noise space used to seed the generative model. Z = Rdz where dz is a hyperparameter. Values z i nZ are sampled from a noise distribution pz (z) (it can be, for example, a simple Gaussian noise model). 434 CHAPTER 14. NEURAL NETS 435 14.16. GENERATIVE ADVERSARIAL NETWORKS • Y is an embedding space used to condition the generative model on some external information, drawn from the training data. Y = RdY where dY is a hyperparameter. Using condition information provided in the training data, we can define a density model py (y ). • X is the data space which represents an image output from the generator or input to the discriminator. Each input is associated with some conditional data y , so we have a density model pdata (x, y ). We have two functions: • G : (Z × Y ) → X is the generative model/generator which takes noise data z i nZ along with an embedding y ∈ Y and produces an output x ∈ X. • D : (X × Y ) → [0, 1] is the discriminative model/discriminator which takes an input x and condition y and predicts the probability under condition y that x came from the empirical data distribution rather than from the generative model. The generator G implicitly defines a conditional density model pg (x|y ). We combine this density model with the existing conditional density py (y ) to yield the joint model pg (x, y ). The task is to parameterize G so that it replicates the empirical density model pdata (x, y ). The conditional GAN objective function becomes: min max(Ex,y ∼pdata (x,y ) [log D(x, y )] + Ey ∼py ,z∼pz (z) [log(1 − D(G(z, y ), y ))]) G D The conditional data y is sampled from either the training data or an independent distribution. In terms of cost functions: we have a batch of training data {(xi , yi )}ni=1 and zi drawn from the noise prior. The cost equation for the discriminator D is a simple logistic cost expression (to give a positive label to input truly from the data distribution and a negative label to counterfeit examples): JD = − n n ∑ 1 ∑ ( log D(xi , yi ) + log(1 − D(G(zi , yi ), yi ))) 2n i=1 i=1 The cost equation for G is (to maximize the probability the discriminator assigns to samples from G, i.e. to trick the discriminator): JG = − n 1∑ log D(G(zi , yi )) n i=1 Note that a “maximally confused” discriminator would output 0.5 for both true and counterfeit examples. Note that we have to be careful how we draw the conditional data y . We can’t just use conditional samples from the data itself because the generator may just learn to reproduce true input based on the conditional input. CHAPTER 14. NEURAL NETS 435 14.16. GENERATIVE ADVERSARIAL NETWORKS 436 Instead, we build a kernel density estimate py (y ) (called a Parzen window estimate) using the conditional values in the training data. We use a Gaussian kernel and cross-validate the kernel width σ using a held-out validation set. Then we draw samples from this density model to use as conditional inputs. 14.16.1 Training generative adversarial networks We have: • • • • x = the data pz (z) a prior for drawing noise samples pg which is the generator’s distribution that we learn G(z; θg ), the generator function (i.e. the generator neural network), which takes as input a noise sample z, parametrized by θg , mapping to the space of x (that is, it outputs a fraudulent sample from x) • D(x; θd ), the discriminator function (i.e. the discriminator neural network), which take as input the output from G, and outputs a scalar which is the estimated probability that the input came from x rather than from pg . Together, D and G play a two-player minimax game with the value function V (G, D): min max(Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]) G D We simultaneously train D to maximize the probability of assigning the correct labels and train G to minimize log(1 − D(G(z))). In particular, we want to train D more quickly to be a more discerning discriminator, which causes G to be a better counterfeiter. However, we don’t want to train D to completion first because it would result in overfitting (and is computationally prohibitive). Rather, we train D for k steps (k is a hyperparameter), then train G for one step, and repeat. Another problem is that early on G is bad at creating counterfeits, and D can recognize them as such easily - this causes log(1 − D(G(z))) to saturate. So instead of training G to minimize log(1 − D(G(z))), we can train it to maximize log(D(G(z)). The basic algorithm is: m = minibatch_size pz = noise_prior px = data distribution for i in range(epochs): for j in range(k): # sample m noise samples from the noise prior 436 CHAPTER 14. NEURAL NETS 437 14.17. REFERENCES z = sample_minibatch(m, pz) # sample m examples from the data x = sample_minibatch(m, px) # update the discriminator by ascending its stochastic gradient update_discriminator(m, z, x) # sample m noise samples from the noise prior z = sample_minibatch(m, pz) # update the generator by descending its stochastic gradient update_generator(m, z) Where update_discriminator has the gradient: ∇θd m 1 ∑ [log D(x (i) ) + log(1 − D(G(z (i) )))] m i=1 and update_generator has the gradient: m 1 ∑ ∇θg log(1 − D(G(z (i) ))) m i=1 The paper used momentum for the gradient updates. 14.17 References • Neural Computing: Theory and Practice (1989). Philip D. Wasserman. • MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy. • Understanding LSTM Networks. Chris Olah. August 27, 2015. • Crash Introduction to Artificial Neural Networks. Ivan Galkin. • Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. • The Nature of Code. Daniel Shiffman. • Neural Networks and Deep Learning, Michael A Nielsen. Determination Press, 2015. • Neural Networks. Christos Stergiou & Dimitrios Siganos. • A Step by Step Backpropagation Example. Matt Mazur. March 17, 2015. • Gradient Descent with Backpropagation. July 31, 2015. Brandon B. CHAPTER 14. NEURAL NETS 437 14.17. REFERENCES 438 • A Primer on Neural Network Models for Natural Language Processing. Yoav Goldberg. October 5, 2015. • Neural Networks for Machine Learning. Geoff Hinton. 2012. University of Toronto (Coursera). • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 1: Setting up the Architecture. Andrej Karpathy. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Backpropagation, Intuitions. Andrej Karpathy. • Neural Turing Machines. Alex Graves, Greg Wayne, Ivo Danihelka. 2014. • Composing Music with Recurrent Neural Networks. Daniel Johnson. August 3, 2015. • Neural Networks. Hugo Larochelle. 2013. Université de Sherbrooke. • General Sequence Learning using Recurrent Neural Networks. Alec Radford. • Recurrent Neural Networks Tutorial, Part 3 – Backpropagation Through Time and Vanishing Gradients. Denny Britz. October 8, 2015. • Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano. Denny Britz. October 27, 2015. • How to implement a recurrent neural network Part 1. Peter Roelants. • Debugging: Gradient Checking. Stanford UFLDL. • A Basic Introduction to Neural Networks. ai-junkie. • Neural Networks in Plain English. • Understanding Natural Language with Deep Neural Networks Using Torch. Soumith Chintala. • 26 Things I Learned in the Deep Learning Summer School. Marek Rei. • Conv Nets: A Modular Perspective. Chris Olah. • Understanding Convolutions. Chris Olah. • Deep Learning, NLP, and Representations. Chris Olah. • How to choose the number of hidden layers and nodes in feedforward neural network. gung, doug. • comp.ai.neural-nets FAQ. Warren S. Sarle. • Fundamentals of Deep Learning. Nikhil Buduma. 2015. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Setting up the data and the model. Andrej Karpathy. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Modeling one neuron. Andrej Karpathy. • Deep Learning Glossary. WildML (Denny Britz). • Batch Normalized Recurrent Neural Networks. César Laurent, Gabriel Pereyra, Philémon Brakel, Ying Zhang, Yoshua Bengio. • An overview of gradient descent optimization algorithms. Sebastian Rudr. • Neuroevolution in Games: State of the Art and Open Challenges. Sebastian Risi, Julian Togelius. November 3, 2015. • Neuroevolution: from architectures to learning. Dario Floreano, Peter Dürr, Claudio Mattiussi. • Conditional generative adversarial nets for convolutional face generation. Jon Gauthier. • Generative Adversarial Nets. Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Youshua Bengio. • Must Know Tips/Tricks in Deep Neural Networks. Xiu-Shen Wei. 438 CHAPTER 14. NEURAL NETS 439 • • • • 14.17. REFERENCES Understanding Convolution in Deep Learning. Tim Dettmers. theano conv documentation Attention and Memory in Deep Learning and NLP. Denny Britz. Chris Olah & Shan Carter, “Attention and Augmented Recurrent Neural Networks”, Distill, 2016. CHAPTER 14. NEURAL NETS 439 14.17. REFERENCES 440 440 CHAPTER 14. NEURAL NETS 441 15 Model Selection Model selection is the process of choosing between different machine learning approaches - e.g. SVM, logistic regression, etc - or choosing between different hyperparameters or sets of features for the same machine learning approach - e.g. deciding between the polynomial degrees/complexities for linear regression. The choice of the actual machine learning algorithm (e.g. SVM or logistic regression) is less important than you’d think - there may be a “best” algorithm for a particular problem, but often its performance is not much better than other well-performing approaches for that problem. There may be certain qualities you look for in an model: • • • • • Interpretable - can we see or understand why the model is making the decisions it makes? Simple - easy to explain and understand Accurate Fast (to train and test) Scalable (it can be applied to a large dataset) Though there are generally trade-offs amongst these qualities. 15.1 Model evaluation In order to select amongst models, we need some way of evaluating their performance. You can’t evaluate a model’s hypothesis function with the cost function because minimizing the error can lead to overfitting. A good approach is to take your data and split it randomly into a training set and a test set (e.g. a 70%/30% split). Then you train your model on the training set and see how it performs on the test set. For linear regression, you might do things this way: CHAPTER 15. MODEL SELECTION 441 15.1. MODEL EVALUATION 442 • Learn parameter θ from training data by minimizing training error J(θ). • Compute test set error (using the squared error) (mtest is the test set size): Jtest (θ) = test 1 m∑ (i) (i) (hθ (xtest ) − ytest ) 2mtest i=1 For logistic regression, you might do things this way: • Learn parameter θ from training data by minimizing training error J(θ). • Compute test set error (mtest is the test set size): test 1 m∑ (i) (i) (i) (i) ytest log hθ (xtest ) + (1 − ytest ) log hθ (xtest ) Jtest (θ) = − mtest i=1 • Alternatively, you can use the misclassification error (“0/1 misclassification error”, read “zeroone”), which is just the fraction of examples that your hypothesis has mislabeled: 1, er r (hθ (x), y ) = 0, test error = 1 mtest if hθ (x) ≥ 0.5, y = 0 or if hθ (x) < 0.5, y = 1 otherwise m test ∑ (i) (i) er r (hθ (xtest ), ytest ) i=1 A better way of splitting the data is to not split it only into training and testing sets, but to also include a validation set. A typical ratio is 60% training, 20% validation, 20% testing. So instead of just measuring the test error, you would also measure the validation error. Validation is used mainly to tune hyperparameters - you don’t want to tune them on the training set because that can result in overfitting, nor do you want to tune them on your test set because that results in an overly optimistic estimation of generalization. Thus we keep a separate set of data for the purpose of validation, that is, for tuning the hyperparameters - the validation set. You can use these errors to identify what kind of problem you have if your model isn’t performing well: • If your training error is large and your validation/test set error is large, then you have a high bias (underfitting) problem. • If your training error is small and your validation/test set error is large, then you have a high variance (overfitting) problem. 442 CHAPTER 15. MODEL SELECTION 443 15.2. EVALUATING REGRESSION MODELS Because the test set is used to estimate the generalization error, it should not be used for “training” in any sense - this includes tuning hyperparameters. You should not evaluate on the test set and then go back and tweak things - this will give an overly optimistic estimation of generalization error. Some ways of evaluating a model’s performance on (some of) your known data are: • hold out (just set aside some portion of the data for validation; this is less reliable if the amount of data is small such that the held out portion is very small) • k-fold cross-validation (better than hold out for small datasets) – – – – the training set is divided into k folds iteratively take k − 1 folds for training and validate on the remaining fold average the results there is also “leave-one-out” cross-validation which is k-fold cross-validation where k = n (n is the number of datapoints) • bootstrapping – new datasets are generated by sampling with replacement (uniformly at random) from the original dataset – then train on the bootstrapped dataset and validate on the unselected data • jackknife resampling – essentially to leave-one-out cross-validation, since leave-one-out is basically sampling without replacement 15.1.1 Validation vs Testing Validation refers to the phase where you are tuning your model and its hyperparameters. Once you do that, you want to test this model on a new set of data it has not seen yet (i.e. data which has not been used in cross-validation or bootstrapping or whatever method you used). This is to simulate the model’s performance on completely new data and see how it does, which is the most important quality of a model. 15.2 Evaluating regression models The main techniques for evaluating regression models are: • • • • mean absolute error median absolute error (root) mean squared error coefficient of determination (R2 ) CHAPTER 15. MODEL SELECTION 443 15.2. EVALUATING REGRESSION MODELS 15.2.1 444 Residuals A residual ei is the difference between the observed and predicted outcome, i.e.: ei = yi − ŷi This can also be thought of as the vertical distance between an observed data point and the regression line. ∑ Fitting a line by least squares minimizes ni=1 ei2 ; that is, it minimizes the mean squared error (MSE) between the line and the data. But there always remain some error from the fit line; this remaining error is the residual. Alternatively, the mean absolute error or median absolute error can be used instead of the mean squared error. ei can be interpreted as estimates of the regression error ϵi , since we can only compute the true error if we know the true model parameters. We can measure the quality of a linear model, which is called goodness of fit. One approach is to look at the variation of the residuals. You can also use the coefficient of determination (R2 ), explained previously, which measures the variance explained by the least squares line. Residual (error) variation Residual variation measures how well a regression line fits the data points. The average squared residual (the estimated residual variance) is the same as the mean squared error, ∑ i.e. σ 2 = n1 ni=1 ei2 . However, to make this estimator unbiased, you’re more likely to see: n 1 ∑ e2 n − 2 i=1 i σ̂ 2 = That is, with the degrees of freedom taken into account (here for intercept and slope, which both have to be estimated). The square root of this estimated variance, σ, is the root mean squared error (RMSE). Coefficient of determination The total variation is equal to the residual variation (variation after removing the predictor) plus the systematic/regression variation (the variation explained by the regression model): n ∑ n ∑ n ∑ i=1 i=1 i=1 (Yi − Ȳ )2 = 444 (Yi − Ŷi )2 + (Ŷi − Ȳ )2 CHAPTER 15. MODEL SELECTION 445 15.2. EVALUATING REGRESSION MODELS R2 (0 ≤ R2 ≤ 1) is the percent of total variability that is explained by the regression model, that is: ∑ ∑ n n 2 regression variation (Ŷi − Ȳ )2 residual variation i=1 (Yi − Ŷi ) R = = ∑i=1 = 1 − = 1 − ∑ n n 2 2 total variation total variation i=1 (Yi − Ȳ ) i=1 (Yi − Ȳ ) 2 R2 can be a misleading summary of model fit since deleting data or adding terms will inflate it. TODO combine the below 15.2.2 Coefficient of determination Example of error For a line y = mx + b, the error of a point (xn , xy ) against that line is: yn − (mxn + b) Intuitively, this is the vertical difference between the point on the line at xn and the actual point at xn . The squared error of the line is the sum of the squares of all of these errors: CHAPTER 15. MODEL SELECTION 445 15.3. EVALUATING CLASSIFICATION MODELS 446 SEline = n ∑ (yi − (mxi + b))2 i=0 To get a best fit line, you want to minimize this squared error. That is, you want to find m and b which minimizes SEline . This works out as 1 : m= x̄ ȳ − xy ¯ 2 ¯ x̄ − x 2 b = ȳ − mx̄ Note that you can alternatively calculate the regression line slope m as with the covariance and variance: m= Cov(x, y ) Var(x) The line that these values yields is the regression line. We can calculate the total variation in y , SEȳ , as: SEȳ = n ∑ (yi − ȳ )2 i=0 And then we can calculate the percentage of total variation in y described by the regression line: 1− SEline SEȳ This is known as the coefficient of determination or R-squared. The closer R-squared is to 1, the better a fit the line is. 15.3 Evaluating classification models Important quantities: • • • • • 1 446 P Sensitivity: T PT+F N TN Specificity: T N+F P P Positive predictive value: T PT+F P TN Negative predictive value: T N+F N +T N Accuracy: T P +FT PP +T N+F N Reminder: a bar over a variable (x̄) means the mean of those values. So x¯2 = x12 +x22 +···+xn2 n CHAPTER 15. MODEL SELECTION 447 15.3.1 15.3. EVALUATING CLASSIFICATION MODELS Area under the curve (AUC) This method is for binary classification and multilabel classification. In binary classification you may choose some cutoff above which you assign a sample to one class, and below which you assign a sample to the other class. Depending on your cutoff, you will get different results - there is a trade off between the true and false positive rates. You can plot a Receiver Operating Characteristic (ROC) curve, which has for its y-axis P (T P ) and for its x-axis P (F P ). Every point on the curve corresponds to a cutoff value. That is, the ROC curve visualizes a sweep through all the cutoff thresholds so you can see the performance of your classifier across all cutoff thresholds, whereas other metrics (such as the F-score and so on) only tell you the performance for one particular cutoff. By looking at all thresholds at once, you get a more complete and honest picture of how your classifier is performing, in particular, how well it is separating the classes. It is insensitive to the bias of the data’s classes - that is, if there are way more or way less of the positive class than there are of the negative class (other metrics may be deceptively favorable or punishing in such unbalanced circumstances). The area under the curve (AUC) is used to quantify how good the classification algorithm is. In general, an AUC of above 0.8 is considered “good”. An AUC of 0.5 (a straight line) is equivalent to random guessing. ROC curves So ROC curves (and the associated AUC metric) are very useful for evaluating binary classification. CHAPTER 15. MODEL SELECTION 447 15.3. EVALUATING CLASSIFICATION MODELS 448 Note that ROC curves can be extended to classification of three or more classes by using the one-vs-all approach (see section on classification). TODO incorporate the explanation below as well: AUC is a metric for binary classification and is especially useful when dealing with high-bias data, that is, where one class is much more common than the other. Using accuracy as a metric falls apart in high-bias datasets: for example, say you have 100 training examples, one of which is is positive, the rest of which are negative. You could develop a model which just labels every thing negative, and it would have 99% accuracy. So accuracy doesn’t really tell you enough here. Many binary classifies output some continuous value (0-1), rather than class labels; there is some threshold (usually 0.5) above which one label is assigned, and below which the other label is assigned. Some models may work best with a different threshold. Changing this threshold leads to a trade off between true positives and false positives - for example, decreasing the threshold will yield more true positives, but also more false positives. AUC runs over all thresholds and plots the the true vs false positive rates. This curve is called a receiver operating characteristic curve, or ROC curve. A random classifier would give you equal false and true positives, which leads to a AUC of 0.5; the curve in this case would be a straight line. The better the classifier is, the more area under the curve there is (so the AUC approaches 1). 15.3.2 Confusion Matrices This method is suitable for binary or multiclass classification. For classification, evaluation often comes in the form of a confusion matrix. The core values are: • • • • True positives (TP): samples classified as positive which were labeled positive True negatives (TN): samples classified as negative which were labeled negative False positives (FP): samples classified as positive which were labeled negative False negatives (FN): samples classified as negative which were labeled positive A few other metrics are computed from these values: • Accuracy: How often is the classifier correct? ( TP+TN total ) • Misclassification rate (or “error rate”): How often is the classifier wrong? ( FP+FN = total 1 − accuracy) • Recall (or “sensitivity” or “true positive rate”): How often are positive-labeled samples TP predicted as positive? ( num positive-labeled examples ) • False positive rate: How often are negative-labeled samples predicted as positive? FP ( num negative-labeled examples ) • Specificity (or “true negative rate”): How often are negative-labeled samples predicted as TN negative? ( num negative-labeled examples ) TP • Precision: How many of the predicted positive samples are correctly predicted? ( TP+FP ) 448 CHAPTER 15. MODEL SELECTION 449 15.3. EVALUATING CLASSIFICATION MODELS examples • Prevalence: How many labeled-positive samples are there in the data? ( num positive-labeled ) num examples Some other values: • Positive predictive value (PPV): precision but takes prevalence into account. With a perfectly balanced dataset (i.e. equal positive and negative examples, that is prevalence is 0.5), the PPV equals the precision. • Null error rate: how often you would be wrong if you just predicted positive for every example. This is a good starting baseline metric to compare your classifier against. • F-score: The weighted average of recall and precision • Cohen’s Kappa: a measure of how well the classifier performs compared against if it had just guessed randomly, that is a high Kappa score happens when there is a big difference between the accuracy and the null error rate. • ROC Curve: (see the section on this) 15.3.3 Log-loss This method is suitable for binary, multiclass, and multilabel classification. Log-loss is an accuracy metric that can be used when the classifier output is not a class but a probability, as is the case with logistic regression. It penalizes the classifier based on how far off it is, e.g. if it predicts 1 with probability of 0.51 but the correct class is 0, it is less “wrong” than if it had predicted class 1 with probability 0.95. For a binary classifier, log-loss is computed: − N 1∑ yi log(ŷi ) + (1 − yi ) log(1 − ŷi ) n i Log-loss is the cross-entropy b/w the distribution of the true labels and the predictions. It is related to relative entropy (that is, Kullback-Leilber divergence). Intuitively, the way this works is the yi terms “turn on” the appropriate parts, e.g. when yi = 1 then the term yi log(ŷi ) is activated and the other is 0. The reverse is true when yi = 0. Because log(1) = 0, we get the best loss (0) when the term within the log operation is 1; i.e. when yi = 1 we want ŷi to equal 1, so the loss comes down to log(ŷi ), but when yi = 0, we want ŷi = 0, so the loss in that case comes down to log(1 − ŷi ). 15.3.4 F1 score The F1 score, also called the balanced F-score or F-measure, is the weighted average of precision and recall: F1 = 2 CHAPTER 15. MODEL SELECTION precision × recall precision + recall 449 15.4. METRIC SELECTION 450 The best score is 1 and the worst is 0. It can be used for binary, multiclass, and multilabel classification (for the latter two, use the weighted average of the F1 score for each class). 15.4 Metric selection When it comes to skewed classes (or high bias data), metric selection is more nuanced. For instance, say you have a dataset where only 0.5% of the data is in category 1 and the rest is in category 0. You run your model and find that it categorized 99.5% of the data correctly! But because of the skew in that data, your model could just be: classify each example in category 0, and it would achieve that accuracy. Note that the convention is to set the rare class to 1 and the other class to 0. That is, we try to predict the rare class. Instead, you may want to use precision/recall as your evaluation metric. 1T 0T 1P True positive False positive 0P False negative True negative Where 1T/0T indicates the actual class and 1P/0P indicates the predicted class. Precision is the number of true positives over the total number predicted as positive. That is, what fraction of the examples labeled as positive actually are positive? true positives true positives + false positives Recall is the number of true positives over the number of actual positives. That is, what fraction of the positive examples in the data were identified? true positives true positives + false negatives So in the previous example, our simple classifier would have a recall of 0. There is a trade-off between precision and recall. Say you are using a logistic regression model for this classification task. Normally, the category threshold in logistic regression is 0.5, that is, predict class 1 if hθ (x) ≥ 0.5 and predict class 0 if hθ (x) < 0.5. 450 CHAPTER 15. MODEL SELECTION 451 15.5. HYPERPARAMETER SELECTION But you may want to only classify an example as 1 if you’re very confidence. So you may change the threshold to 0.9 to be stricter about your classifications. In this case, you would increase precision, but lower recall since the model may not be confident enough about some of the more ambiguous positive examples. Conversely, you may want to lower the threshold to avoid false negatives, in which case recall increases, but precision decreases. So how do you compare precision/recall values across algorithms to determine which is best? You can condense precision and recall into a single metric: the F1 score (also just called the F-score, which is the harmonic mean of the precision and recall): F1 score = 2 PR P +R Although more data doesn’t always help, it generally does. Many algorithms perform significantly better as they get more and more data. Even relatively simple algorithms can outperform more sophisticated ones, solely on the basis of having more training data. If your algorithm doesn’t perform well, here are some things to try: • • • • • • Get more training examples (can help with high variance problems) Try smaller sets of features (can help with high variance problems) Try additional features (can help with high bias problems) Try adding polynomial features (x12 , x22 , x1 x2 , etc) (can help with high bias problems) Try decreasing the regularization parameter λ (can help with high bias problems) Try increasing the regularization parameter λ (can help with high variance problems) 15.5 Hyperparameter selection Another part of model selection is hyperparameter selection. Hyperparameter tuning is often treated as an art, i.e. without a reliable and practical systematic process for optimizing them. However, there are some automated methods that can be useful, including: • • • • grid search random search evolutionary algorithms Bayesian optimization Random search and grid search don’t perform particularly well but are worth being familiar with. CHAPTER 15. MODEL SELECTION 451 15.5. HYPERPARAMETER SELECTION 15.5.1 452 Grid search Just searching through combinations of different hyperparameters and seeing which combination performs the best. Generally hyperparameters are searched over specific intervals or scales, depending on the particular hyperparameter. It may be 10, 20, 30, etc or 1e-5, 1e-4, 1e-3, etc. It is easy to parallelize but quite brute-force. 15.5.2 Random search Surprisingly, randomly sampling from the full grid often works just as well as a complete grid search, but in much less time. Intuitively: if we want the hyperparameter combination leading to the top 5% of performance, then any random hyperparameter combination from the grid has a 5% chance of leading to that result. If we want to successfully find such a combination 95% of the time, how many random combinations do we need to run through? If we take n hyperparameter combinations, the probability that all n are outside of this 5% of top combinations is (1 − 0.05)n , so the probability that at least one is in the 5% is just 1 − (1 − 0.05)n . If we want to find one of these combinations 95% of the time, that is, we want the probability that at least one of them to be what we’re looking for to be 95%, then we just set 1 − (1 − 0.05)n = 0.95, and thus n ≥ 60, so we need to try only 60 random hyperparamter combinations at minimum to have a 95% chance of finding at least one hyperparameter combination that yields top 5% performance for the model. 15.5.3 Bayesian Hyperparameter Optimization We can use Bayesian optimization to select good hyperparameters for us. We can sample hyperparameters from a Gaussian process (the prior) and use the result as observations to compute a posterior distribution. Then we select the next hyperparameters to try by optimizing the expected improvement over the current best result or the Gaussian process upper confidence bound (UCB). In particular, we choose an acquisition function to construct a utility function from the model posterior - this is what we use to decide what next set of hyperparameters to try. Basic idea: Model the generalization performance of an algorithm as a smooth function of its hyperparameters and then try to find the maxima. It has two parts: • Exploration: evaluate this function on sets of hyperparameters where the outcome is most uncertain • Exploitation: evaluate this function on sets of hyperparameters which seem likely to output high values Which repeat until convergence. 452 CHAPTER 15. MODEL SELECTION 453 15.5. HYPERPARAMETER SELECTION This is faster than grid search by making “educated” guesses as to where the optimal set of hyperparameters might be, as opposed to brute-force searching through the entire space. One problem is that computing the results of a hyperparameter sample can be very expensive (for instance, if you are training a large neural network). We use a Gaussian process because its properties allow us to compute marginals and conditionals in closed form. Some notation for the following: • f (x) is the function drawn from the Gaussian process prior, where x is the set of hyperparameters • observations are in the form {xn , yn }N n=1 , where yn ∼ N (f (xn ), v ) and v is the variance of noise introduced into the function observations • the acquisition function is a : X → R+ , where X is the hyperparameter space • the next set of hyperparameters to try is xnext = argmaxx a(x) • the current best set of hyperparameters is xbest • Φ() denotes the cumulative distribution function of the standard normal A few popular choices of acquisition functions include: • probability of improvement: with a Gaussian process, this can be computed analytically as: aPI (x; {xn , yn }θ) = Φ(γ(x)) f (xbest − µ(x; {xn , yn }, θ) γ(x) = σ(x; {xn , yn }, θ) • expected improvement: under a Gaussian process, this also has a closed form: aEI (x; {xn , yn }, θ) = σ(x; {xn , yn }, θ)(γ(x)Φ(γ(x)) + N (γ(x); 0, 1)) • Gaussian process upper confidence bound: use upper confidence bounds (when maximizing, otherwise, lower confidence bounds) to construct acquisition functions that minimize regret over the course of their optimization: aLCB (x; {xn , yn }, θ) = µ(x; {xn , yn }, θ) − κσ(x; {xn , yn }, θ) Where κ is tunable to balance exploitation against exploration. Some difficulties with Bayesian optimization of hyperparameters include: CHAPTER 15. MODEL SELECTION 453 15.5. HYPERPARAMETER SELECTION 454 • often unclear what the appropriate choice for the covariance function and its associated hyperparameters (these hyperparameters are distinct from the ones the method is optimizing; i.e. these are in some sense “hyper-hyperparameters”) • the function evaluation can be a time-consuming optimization procedure. One method is to optimize expected improvement per second, thereby taking wall clock time into account. That way, we prefer to evaluate points that are not only likely to be good, but can also be evaluated quickly. However, we don’t know the duration function c(x) : X → R+ , but we can use this same Gaussian process approach to model c(x) alongside f (x). Furthermore, we can parallelize these Bayesian optimization procedures (refer to paper) 15.5.4 Choosing the Learning Rate α You can plot out a graph with the number of gradient descent iterations on the x-axis and the values of minθ J(θ) on the y-axis and visualize how the latter changes with the number of iterations. At some point, that curve will flatten out; that’s about the number of iterations it took for gradient descent to converge on your particular problem. Decreasing error over number of iterations You could use an automatic convergence test which just declares convergence if J(θ) decreases by less than some threshold value in an iteration, but in practice that threshold value may be difficult to determine. You would expect this curve to be similar to the one above. mi nθ J(θ) should decrease with the 454 CHAPTER 15. MODEL SELECTION 455 15.6. CASH number of iterations, if gradient descent is working correctly. If not, then you should probably be using a smaller learning rate (α). But again, don’t make it too small or convergence will be slow. 15.6 CASH The particular problem is called the CASH problem (Combined Algorithm Selection and Hyperparameter optimization problem). It can be formalized as such: • A = {A(1) , . . . , A(R) } is a set of algorithms – the algorithm A(j) ’s hyperparameters has the domain Λ(j) • Dtrain = {(x1 , y1 ), . . . , (xn , yn )} be a training set (1) (K) (1) (K) – it is split into K cross-validation folds Dvalid , . . . , Dvalid and Dtrain , . . . , Dtrain (i) (i) (i) (j) • the loss L(A(j) achieves on Dvalid when trained on λ , Dtrain , Dvalid ) is the loss an algorithm A (i) Dtrain with hyperparameters λ we want to find the joint algorithm and hyperparameter settings that minimizes this loss: ∗ A , λ∗ ∈ argmin A(j) ∈A,λ∈Λ(j) K 1 ∑ (i) (i) L(A(j) λ , Dtrain , Dvalid ) K i=1 Approaches to this problem include the aforementioned Bayesian optimization methods. Meta-learning is another approach, in which machine learning is applied machine learning itself, that is, to algorithm and hyperparameter selection (and additionally feature preprocessing). The input data are different machine learning tasks and datasets, the output is a well-performing algorithm and hyperparameter combination. In meta-learning we learn “meta-features” to identify similar problems for which a algorithm and hyperparameter combination is good for. These meta-features can include things like the number of datapoints, features, and classes, the data skewness, the entropy of the targets, etc. Meta-learning can be combined with Bayesian optimization - it can be used to roughly identify good algorithm and hyperparameter choices, and Bayesian optimization can be used to fine-tune these choices. This approach of using meta-learning to support Bayesian optimization is called “warmstarting”. As Bayesian optimization searches for hyperparameters it may come across many well-performing hyperparameters that it discards because they are not the best. However, they can be saved to construct an (weighted) ensemble model, which usually outperforms individual models. The ensemble selection method seems to work best for constructing the ensemble: • start with an empty ensemble CHAPTER 15. MODEL SELECTION 455 15.7. REFERENCES 456 • iteratively, up to a specified ensemble size – add a model that maximizes ensemble validation performance Models are unweighted, but models can be added multiple time so the end result is a weighted ensemble. 15.7 References • Review of fundamentals, IFT725. Hugo Larochelle. 2012. • Exploratory Data Analysis Course Notes. Xing Su. • Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman. • Machine Learning. 2014. Andrew Ng. Stanford University/Coursera. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Evaluating Machine Learning Models. Alice Zheng. 2015. • Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. • Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy. • POLS 509: Hierarchical Linear Models. Justin Esarey. • Bayesian Inference with Tears. Kevin Knight, September 2009. • Learning to learn, or the advent of augmented data scientists. Simon Benhamou. • Practical Bayesian Optimization of Machine Learning Algorithms. Larochelle, Ryan P. Adams. Jasper Snoek, Hugo • What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou. • Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010. • Maximum Likelihood Estimation. Penn State Eberly College of Science. • Data Science Specialization. Johns Hopkins (Coursera). 2015. 456 CHAPTER 15. MODEL SELECTION 457 15.7. REFERENCES • Practical Machine Learning. Johns Hopkins (Coursera). 2015. • Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman. • Model evaluation: quantifying the quality of predictions. scikit-learn. • Efficient and Robust Automated Machine Learning. Matthias Feurer, Aaron Klein, Katharina Eggensperger, Jost Tobias Springenberg, Manuel Blum, Frank Hutter. — prerequisites: CHAPTER 15. MODEL SELECTION 457 15.8. BAYES NETS 15.8 458 458 bayes nets CHAPTER 15. MODEL SELECTION 459 16 Bayesian Learning Bayesian machine learning is a particular set of approaches to probabilistic machine learning (for other probabilistic models, see Supervised Learning). Bayesian learning treats model parameters as random variables - in Bayesian learning, parameter estimation amounts to computing posterior distributions for these random variables based on the observed data. Bayesian learning typically involves generative models - one notable exception is Bayesian linear regression, which is a discriminative model. 16.1 Bayesian models Bayesian modeling treats those two problems as one. We first have a prior distribution over our parameters (i.e. what are the likely parameters?) P (θ). From this we compute a posterior distribution which combines both inference and learning: P (y1 , . . . , yn , θ|x1 , . . . , xn ) = P (x1 , . . . , xn , y1 , . . . , yn |θ)P (θ) P (x1 , . . . , xn ) Then prediction is to compute the conditional distribution of the new data point given our observed data, which is the marginal of the latent variables and the parameters: P (xn+1 |x1 , . . . , xn ) = ∫ P (xn+1 |θ)P (θ|x1 , . . . , xn )dθ Classification then is to predict the distributions of the new datapoint given data from other classes, then finding the class which maximizes it: CHAPTER 16. BAYESIAN LEARNING 459 16.1. BAYESIAN MODELS 460 P (xn+1 |x1c , . . . , xnc ) 16.1.1 ∫ = P (xn+1 |θc )P (θc |x1c , . . . , xnc )dθc Hidden Markov Models HMMs can be thought of as clustering over time; that is, each state is a “cluster”. The data points and latent variables are sequences, and πk becomes the transition probability given the state (cluster) k. θk∗ becomes the emission distribution for x given state k. 16.1.2 Model-based clustering • model data from heterogeneous unknown sources • K unknown sources (clusters) • each cluster/source is modeled using a parametric model (e.g. a Gaussian distribution) For a given data point i , we have: zi |π ∼ Discrete(π) Where zi is the cluster label for which data point i belongs to. This is the latent variable we want to discover. π is the mixing proportions which is the vector of probabilities for each class k, that is: π = (πi , . . . , πK )|α ∼ Dirichlet( α α ,..., ) K K That is, πk = P (zi = k). We also model each data point xi as being drawn from a source (cluster) like so, where F is however we are modeling the cluster (e.g. a Gaussian), parameterized by θz∗i , that is some parameters for the zi -labeled cluster: xi |zi , θk∗ ∼ F (θz∗i ) (Note that the star, as in θ∗ , is used to denote the optimal solution for θ.) For this approach we have two priors over parameters of the model: • For the mixing proportions, we typically use a Dirichlet prior (above) because it has the nice property of being a conjugate prior with multinomial distributions. • For each cluster k we use some prior H, that is θk∗ |H ∼ H. Graphically, this is: 460 CHAPTER 16. BAYESIAN LEARNING 461 16.1. BAYESIAN MODELS Model-based clustering plate model 16.1.3 Naive Bayes The main assumption of Naive Bayes is that all features are independent effects of the label. This is a really strong simplifying assumption but nevertheless in many cases Naive Bayes performs well. Naive Bayes is also statistically efficient which means that it doesn’t need a whole lot of data to learn what it needs to learn. If we were to draw it out as a Bayes’ net: Y → F1 Y → F2 ... Y → Fn Where Y is the label and F1 , F2 , . . . , Fn are the features. The model is simply: P (Y |F1 , . . . , Fn ) ∝ P (Y ) ∏ P (Fi |Y ) i This just comes from the Bayes’ net described above. The Naive Bayes learns P (Y, f1 , f2 , . . . , fn ) which we can normalize (divide by P (f1 , . . . , fn )) to get the conditional probability P (Y |f1 , . . . , fn ): ∏ P (y1 , f1 , . . . , fn ) P (y1 ) i P (fi |y1 ) ∏ P (y2 , f1 , . . . , fn ) P (y2 ) i P (fi |y2 ) P (Y, f1 , . . . , fn ) = = .. .. . . P (yk , f1 , . . . , fn ) CHAPTER 16. BAYESIAN LEARNING P (yk ) ∏ i P (fi |yk ) 461 16.2. INFERENCE IN BAYESIAN MODELS 462 So the parameters of Naive Bayes are P (Y ) and P (Fi |Y ) for each feature. 16.2 Inference in Bayesian models 16.2.1 Maximum a posteriori (MAP) estimation A Bayesian alternative to MLE, we can estimate probabilities using maximum a posteriori estimation, where we instead choose a probability (a point estimate) that is most likely given the observed data: π̃MAP = argmax P (π|X) π = argmax π P (X|π)P (π) P (X) = argmax P (X|π)P (π) π P (y |X) ≈ P (y |π̃MAP ) So unlike MLE, MAP estimation uses Bayes’ Rule so the estimate can use prior knowledge (P (π)) about what we expect π to be. Again, this may be done with log-likelihoods: θMAP = argmax p(θ|x) = argmax log p(x|θ) + log p(θ) θ θ 16.3 Maximum A Posteriori (MAP) Likelihood function L(θ) is the probability of the data D as a function of the parameters θ. This often has very small values so typically we work with the log-likelihood function instead: ℓ(θ) = log L(θ) The maximum likelihood criterion simply involves choosing the parameter θ to maximize ℓ(θ). This can (sometimes) be done analytically by computing the derivative and setting it to zero and yields the maximum likelihood estimate. MLE’s weakness is that if you have only a little training data, it can overfit. This problem is known as data sparsity. For example, you flip a coin twice and it happens to land on heads both times. Your maximum likelihood estimate for θ (probability that the coin lands on heads) would be 1! We can then try to generalize this estimate to another dataset and test it by measuring the log-likelihood on the test set. If a tails shows up at all in the test set, we will have a test log-likelihood of −∞. 462 CHAPTER 16. BAYESIAN LEARNING 463 16.3. MAXIMUM A POSTERIORI (MAP) We can instead use Bayesian techniques for parameter estimation. In Bayesian parameter estimation, we treat the parameters θ as a random variable as well, so we learn a joint distribution p(θ, D). We first require a prior distribution p(θ) and the likelihood p(D|θ) (as with maximum likelihood). We want to compute p(θ|D), which is accomplished using Bayes’ rule: p(θ|D) = ∫ p(θ)p(D|θ) p(θ′ )p(D|θ′ )dθ′ Though we work with only the numerator for as long as possible (i.e. we delay normalization until it’s necessary): p(θ|D) ∝ p(θ)p(D|θ) The more data we observe, the less uncertainty there is around the parameter, and the likelihood term comes to dominate the prior - we say that the data overwhelm the prior. We also have the posterior predictive distribution p(D′ |D), which is the distribution over future observables given past observations. This is computed by computing the posterior over θ and then marginalizing out θ: p(D′ |D) = ∫ p(θ|D)p(D′ |θ)dθ The normalization step is often the most difficult, since we must compute an integral over potentially many, many parameters. We can instead formulate Bayesian learning as an optimization problem, allowing us to avoid this integral. In particular, we can use maximum a-posteriori (MAP) approximation. Whereas with the previous Bayesian approach (the “full Bayesian” approach) we learn a distribution over θ, with MAP approximation we simply get a point estimate (that is, a single value rather than a full distribution). In particular, we get the parameters that are most likely under the posterior: θ̂MAP = argmax p(θ|D) θ = argmax p(θ, D) θ = argmax p(θ)p(D|θ) θ = argmax log p(θ) + log p(D|θ) θ Maximizing log p(D|θ) is equivalent to MLE, but now we have an additional prior term log p(θ). This prior term functions somewhat like a regularizer. In fact, if p(θ) is a Gaussian distribution centered at 0, we have L2 regularization. CHAPTER 16. BAYESIAN LEARNING 463 16.3. MAXIMUM A POSTERIORI (MAP) 16.3.1 464 Markov Chain Monte Carlo (MCMC) Motivation With MLE and MAP estimation we get only a single value for π, and this collapsing into a single value loses information - what if we instead considered the entire distribution of values for π, i.e. P (π|X)? As it stands, with MLE and MAP we only get an approximation of P (y |X). But with the distribution P (π|X) we could directly compute its expected value: E[P (y |X)] = ∫ P (y |π)P (π|X)dπ And with Bayes’ Rule we have: P (π|X) = P (X|π)P (π) P (X|π)P (π) =∫ P (X) π P (X|π)P (π)dπ So we have two integrals here, and unfortunately integrals can be hard (sometimes impossible) to compute. With MCMC we can get the values we need without needing to calculating the integrals. Monte Carlo methods Monte Carlo methods are algorithms which perform probabilistic simulations to give you some value. For example: Say you have a square and a circle inscribed within it, so that they are co-centric and the circle’s diameter is equal to the length of a side of the square. You take some rice and uniformly scatter it in the shapes at random. You can count the total number of grains of rice in the circle (C) and do the same for rice in the square (S). The ratio CS approximates the ratio of the area of the circle to the area of the square. The area of the circle and for the square can be thought of as integrals (adding an infinite number of infinitesimally small points), so what you have effectively done is approximate the value of integrals. MCMC In the example the samples were uniformly distributed, but in practice they can be drawn from other distributions. If we collect enough samples from the distribution we can compute pretty much anything we would want to know about the distribution - mean, standard deviation, etc. For example, we can compute the expected value: 464 CHAPTER 16. BAYESIAN LEARNING 465 16.3. MAXIMUM A POSTERIORI (MAP) N 1 ∑ f (z (t) ) N→∞ N t=1 E[f (z)] = lim Since we don’t sample infinite points, we sample as many as we can for an approximation: E[f (z)] ≈ T 1∑ f (z (t) ) T t=1 How exactly then is the sampling of z (0) , . . . , z (T ) according to a given distribution accomplished? We treat the sampling process as a walk around a sample space and the walk proceeds as a Markov chain; that is, the choice of the next sample depends on only the current state, based on a transition probability Ptrans (z (t+1) |z (t) ). So the general walking algorithm is: • Randomly initialize z (0) • for t = 1 to T do: – z (t+1) := g(z (t) ) Where g is just a function which returns the next sample based on Ptrans and the current sample. Gibbs Sampling Gibbs sampling is an MCMC algorithm, where z is a point/vector [z1 , . . . , zk ] and k > 1. So here the samples are vectors of at least two terms. You don’t select an entire sample at once, what you do is make a separate probabilistic choice for each dimension, where the choice is dependent on the other k − 1 dimensions, using the newest values for each. For example, say k = 3 so you have vectors in the form [z1 , z2 , z3 ]. • First you pick a new value z1(t+1) based on z2(t) and z3(t) . • Then you pick a new value z2(t+1) based on z1(t+1) and z3(t) . • Then you pick a new value z3(t+1) based on z1(t+1) and z2(t+1) . Gibbs Sampling (more) Now that we have the generative model, we can use it to calculate the probability of some set of group assignments for our data points. But how do we learn what a good set of group assignments is? We can use Gibbs Sampling, that is: • Take the set of data points, and randomly initialize group assignments. CHAPTER 16. BAYESIAN LEARNING 465 16.4. NONPARAMETRIC MODELS 466 • Pick a point. Fix the group assignments of all the other points, and assign the chosen point a new group (which can be either an existing cluster or a new cluster) with a CRP-ish probability (as described in the models above) that depends on the group assignments and values of all the other points. • We will eventually converge on a good set of group assignments, so repeat the previous step until happy. 16.4 Nonparametric models First: a parametric model is one in which the capacity is fixed and does not increase with the amount of training data. For example, a linear classifier, a neural network with fixed number of hidden units, etc. The amount of parameters is finite, and the particular amount is determined before any data is observed (e.g. with linear regression, we decide the number of parameters that will be used, rather than learning it from the data). Another way of thinking of it is: a parametric model tries to come up with some function from the data, then the data is thrown out. You use that learned function in place of the data for future predictions. A nonparametric model doesn’t throw out the data, it keeps it around for later predictions; as a result, as more data becomes available, you don’t need to create an updated model like you would with the parametric approach. 16.4.1 What is a nonparametric model? • counterintuitively, it does not mean a model without parameters. Rather, it means a model with a very large number of parameters (e.g. infinite). Here, “nonparametric” refers more to “not a parametric model”, not “without parameters”. • could also be defined as a parametric model where the number of parameters increases with the data, instead of fixing the number of parameters (that is, the number of things we can learn) as is the case with parametric models. I.e. the capacity of the model increases with the amount of training data. • can also be defined as a family of distributions that is dense in some large space relevant to the problem at hand. – For example, with a regression problem, the space of possible solutions may be all continuous functions, which is infinite-dimensional (if you have infinite cardinality). A nonparametric model can span this infinite space. To expand and visualize the last point, consider the regression problem example. This is the space of continuous functions, where f ∗ is the function we are looking for. With a parametric model, we have a finite number of parameters, so we can only cover a fraction of this space (the square). 466 CHAPTER 16. BAYESIAN LEARNING 467 16.4. NONPARAMETRIC MODELS Space of continuous functions Space of continuous functions w/ parametric model Space of continuous functions w/ nonparametric model CHAPTER 16. BAYESIAN LEARNING 467 16.4. NONPARAMETRIC MODELS 468 However, with a nonparametric model, we can have infinite parameters and cover the entire space. We apply some assumptions, e.g. favoring simpler functions over complex ones, so we can apply a prior to the space which assigns more mass to simpler functions (the darker parts in the accompanying figure). But every part of the space still has some mass. It is possible to create a nonparametric model by nesting a parametric learning algorithm inside another parametric learning algorithm. The outer learning algorithm learns the number of parameters, whereas the inner learning algorithm performs as it normally would (learning the parameters themselves). 16.4.2 An example An example of a nonparametric model is nearest neighbor regression, in which we simply store the training set, then, for a given new point, identify the closest point and return its associated target value. That is: ŷ = yi i = argmin ||Xi − x||22 Another example is wrapping a parametric algorithm instead another parametric algorithm - where the number of parameters of the inner algorithm is a parameter that the outer parametric algorithm learns. 16.4.3 Parametric models vs nonparametric models Parametric models are relatively rigid; once you choose a model, there are some limitations to what forms that model can take (i.e. how it can fit to the data), and the only real flexibility is in the parameters which can be adjusted. For instance, with linear regression, the model must take the form of y = β0 + β1 x1 + · · · + βn xn ; you can only adjust the βi values. If the “true” model does not take this form, we probably won’t be able to estimate it well because the model we chose fundamentally does not conform to it. Parametric models, on the other hand, offer greater freedom of fit. As an example, a histogram can be considered a nonparametric representation of a probability density - it “let’s the data speak for itself”, so to speak (you may hear nonparametric models described in this way). The density that forms in the histogram is determined directly by the data. You don’t make any assumptions about what the distribution is beforehand - e.g. you don’t have to say, “I think this might be a normal distribution”, and then try to force the normal probability density function onto the data. Nonparametric models don’t actually mean there are no parameters, but it is perhaps better described as not having a fixed set of parameters. 468 CHAPTER 16. BAYESIAN LEARNING 469 16.5. THE DIRICHLET PROCESS 16.4.4 Why use a Bayesian nonparametric approach? 1. Model selection • e.g. clustering - you have to specify the number of clusters. Too many and you overfit, too few and you underfit. • with a Bayesian approach you are not doing any optimizing (such as finding a maximum likelihood), you are just computing a posterior distribution. So there is no “fitting” happening, so you cannot overfit. • If you have a large model or one which grows with the amount of data, you can avoid underfitting too. • (of course, you can still specify an incorrect model and get poor performance) 2. Useful properties of Bayesian nonparametric models • Exchangeability - you can permute your data without affecting learning (i.e. order of your data doesn’t matter) • Can model Zipf, Heap, and other power laws • Flexible ways of building complex models from simpler parts Nonparametric models still make modeling assumptions, they are just less constrained than most parametric models. There are also semiparametric models in which they are nonparametric in some ways and parametric in others. 16.5 The Dirichlet Process The Dirichlet process is “the cornerstone of Bayesian nonparametrics”. It is a stochastic process - a model over an infinite collection of random variables. There are a few ways to think about Dirichlet processes: • the infinite limit of a Gibbs sampler for finite mixture models • the Chinese Restaurant Process • The stick-breaking construction 16.5.1 Dirichlet Distribution The Dirichlet distribution is a probability distribution over all possible multinomial distributions. For example, say we have some data which we want to classify into three classes A, B, C. Maybe the data has 0.25 probability of being in class A, 0.5 probability of being in B, and 0.25 of being in C. Or maybe it has 0.1 probability of being in class A, then 0.6 and 0.3 for B and C respectively. CHAPTER 16. BAYESIAN LEARNING 469 16.5. THE DIRICHLET PROCESS 470 Or it could be another distribution - we don’t know. The Dirichlet distribution is the probability distribution representing these possible multinomial distributions across our classes. The Dirichlet distribution is formalized as: ∑K−1 ∏ a −1 αk ) K−1 P (p|a) = ∏K−1k=0 pk k Γ(α ) k k=0 k=0 Γ( where: • p = a multinomial distribution • α = the parameters of the dirichlet (a K-dimensional vector) • K = the number of categories Note that this term: Γ( ∑K−1 αk ) Γ(αk ) k=0 ∏K−1 k=0 Is just a normalizing constant so that we get a distribution. So if you’re just comparing ratios of these distributions you can ignore it. You begin with some prior which can be derived from other data or from domain knowledge or intuition. As more data comes in, we update the dirichlet (i.e. with Bayesian updates): P (p|data) = P (data|p)P (p|α) P (p) This can be done simply as updating the column in α which corresponds to a new data point, e.g. if we have three classes and α = [2, 4, 1] and we encounter a new data point which belongs to the class α1 , we just add one to that column in α, so it becomes [2, 5, 1]. Entropy Also known as information content, energy, log likelihood, or − ln(p) It can be thought of as the amount of “surprise” for an event. If an event is totally certain, it has zero entropy. A coin flip as some entropy since there are only two equally-probable possibilities. If you have a pair of dice, there is some entropy for rolling a 6 (because there are multiple combinations which can lead to 6) but much higher entropy for rolling a 12 (because there is only one combination which leads to a 12). We can look at the entropy of the Dirichlet function: 470 CHAPTER 16. BAYESIAN LEARNING 471 16.5. THE DIRICHLET PROCESS E(p|α) = − ln( K−1 ∏ pkαk −1 ) k0 = K−1 ∑ (αk − 1)(− ln(pk )) k0 We’ll break out the entropy of a given multinomial distribution p into its own term: ek = − ln(pk ) Interpreting α We can take the α vector and normalize it. The normalized α vector is the expected value of the dirichlet, that is, it is its mean. The sum of the unnormalized α vector is the weight of the distribution, which can be thought of as 1 its precision. In a normal distribution, the precision is variance ; a higher precision means a narrower normal distribution which means that values are likely to be near the mean. A lower precision means a wider distribution in which points are less likely to be near the mean. So a dirichlet with a higher weight means that the multinomial distribution is more likely to be close to the expected value. Dirichlet distributions can be thought of as a simplex, which is a generalization of a triangle in some arbitrary dimensions (e.g. in 2D it is 2-simplex, a triangle, in 3D it is 3-simplex, a pyramid, etc.). Some examples are blow with their corresponding α vectors: 16.5.2 Finite Mixture Models This is a continuation of the model-based clustering approach mentioned earlier. We want to learn, via inference, values for π, zi , and θk∗ . We can use a form of MCMC sampling - Gibbs sampling. (to do: this is incomplete) 16.5.3 Chinese Restaurant Process Partitions Given a set S, a partition ϱ is a disjoint family of non-empty subsets (clusters) of S whose union is S. So a partition is some configuration of clusters which encompasses the members of S. E.g. CHAPTER 16. BAYESIAN LEARNING 471 16.5. THE DIRICHLET PROCESS 472 Some Dirichlet simplexes with their α vectors S = {A, B, C, D, E, F } ϱ = {{A, D}, {B, C, E}, {F }} The set of all partitions of S is denoted PS . Random partitions are random variables taking value in PS . The Chinese Restaurant Process (CRP) The CRP is an example of random partitions and involves a sequence of customers coming into a restaurant. Each customer decides whether or not to sit at a new (empty) table or join a table with other customers. The customers are sociable so prefer to join tables with more customers, but there is still some probability that they will sit at a new table: P (sit at new table) = P (sit at table c) = α α+ ∑ c∈ϱ nc nc α+ ∑ c∈ϱ nc Where nc is the number of customers at a table c and α is a parameter. Here the customers correspond to members of the set S, and tables are the clusters in a partition ϱ of S. 472 CHAPTER 16. BAYESIAN LEARNING 473 16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS This process has a rich-get-richer property, in that large clusters are more likely to attract more customers, thus growing larger, and so on. If you multiply all the conditional probabilities together, the overall probability of the partition ϱ, called the exchangeable partition probability function (EPPF), is: P (ϱ|α) = α|ϱ| Γ(α) ∏ Γ(|c|) Γ(n + α) c∈ϱ This probability ends up not depending on the sequence in which customers arrive - so this is an exchangeable random partition. The α parameter affects the number of clusters in the partition - the larger the α, the more clusters we expect to see. Model-based Clustering with the Chinese Restaurant Process Given a dataset S, we want to partition it into clusters of similar items. Each cluster c ∈ ϱ is described by a model F (θc∗ ), for example a Gaussian, parameterized by θc∗ . We model each item in each cluster as drawn from that cluster’s model. We are going to use a Bayesian approach, so we introduce a prior over ϱ and θc∗ and the compute posteriors over both. We use a CRP mixture model; that is we use a Chinese Restaurant Process for the prior over ϱ and an independent and identically distributed (iid) prior H over the cluster parameters θc∗ . So the CRP mixture model in more detail: • ϱ ∼ CRP (α) • θc∗ |ϱ ∼ H for c ∈ ϱ • xi |θc∗ , ϱ ∼ F (θc∗ ) for c ∈ ϱ with i ∈ c 16.6 Infinite Mixture Models and the Dirichlet Process (this is basically a paraphrasing of this post by Edwin Chen.) Many clustering methods require the specification of a fixed number of clusters. However, in realworld problems there may be an infinite number of possible clusters - in the case of food there may be Italian or Chinese or fast-food or vegetarian food and so on. Nonparametric Bayesian methods allow parameters to change with the data; e.g. as we get more data we can let the number of clusters grow. Say we have some data, where each data point is some vector. We can view our data from a generative perspective: we can assume that the true clusters in the data are each defined by some model with some parameters, such as Gaussians with µi and σi parameters. CHAPTER 16. BAYESIAN LEARNING 473 16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS 474 We further assume that these parameters themselves come from a distribution G0 Then we assume the data is generated by selecting a cluster, then taking a sample from that cluster. Ok, how then do we assign the data points to groups? 16.6.1 Chinese Restaurant Process (see explanation above) (As a side note, the Indian Buffet Process is an extension of the CRP in which customers can sample food from multiple tables, that is, they can belong to multiple clusters.) More formally: • Generate table assignments g1 , . . . , g2 ∼ CRP (α), that is, according to a Chinese Restaurant Process. gi is the table assigned to datapoint i . • We generate table parameters ϕ1 , . . . , ϕm ∼ G0 according to the base distribution G0 , where ϕk is the parameter for the kth distinct group. • Given the table assignments and table parameters, generate each datapoint pi ∼ F (ϕgi ) from a distribution F with the specified table parameters. For example, F could be a Gaussian and phii might be a vector specifying the mean and standard deviation. 16.6.2 Polya Urn Model Basically the same as the Chinese Restaurant Process, except that while the CRP specifies a distribution over partitions (see above), the Polya Urn model does that and also assigns parameters to each group. Say we have an urn containing αG0 (x) balls of some color x for each possible value of x. G0 is our base distribution and G0 (x) is the probability of sampling x from G0 . Then we iteratively pick a ball at random from the urn, place it back and also place an additional new ball of the same color of the one we drew. As α increases (that is, we draw more new ball colors from the base distribution, which is the same as placing more weight on our prior), the colors in the urn tend towards the base distribution. More formally: • Generate colors ϕ1 , . . . , ϕn ∼ P oly a(G0 , α), that is, according to a Polya Urn Model. ϕi is the color of the i th ball. • Given the ball colors, generate each datapoint pi ∼ F (ϕi ) (where we are using F is a way like in the Chinese Restaurant Process above). 16.6.3 Stick-Breaking Process The stick-breaking process is also very similar to the CRP and the Polya Urn model. 474 CHAPTER 16. BAYESIAN LEARNING 475 16.6. INFINITE MIXTURE MODELS AND THE DIRICHLET PROCESS We start with a “stick” of length one, then generate a random variable β1 ∼ Beta(1, α). Since we’re drawing from the Beta distribution, β1 will be a real number between 0 and 1 with the expected 1 value 1+α . Then break off the stick at β1 . We define w1 to be the length of the left stick. Then we take the right piece (the one we broke off) and generate β1 ∼ Beta(1, α). Then break off the stick at β2 , set w2 to be the length of the stick to the right, and so on. Here α again functions as a dispersion parameter; when it is low there are few, denser clusters, when it is high, there are more clusters. More formally: • Generate group probabilities (stick lengths) w1 , . . . , w∞ ∼ Stick(α), that is, according to a Stick-Breaking process. • Generate group parameters ϕ1 , . . . , ϕ∞ ∼ G0 , where ϕk is the parameter for the kth distinct group. • Generate group assignments g1 , . . . , gn ∼ Multinomi al(w1 , . . . , w∞ ) for each datapoint. • Given group assignments and group parameters, generate each datapoint pi ∼ F (ϕgi ) (where we are using F is a way like in the Chinese Restaurant Process above). 16.6.4 Dirichlet Process The CRP, Polya Urn Model, and Stick-Breaking Process are all connected to the Dirichlet Process. Suppose we have a Dirichlet process DP (G0 , α) where G0 is the base distribution and α is the dispersion parameter. Say we want to sample xi ∼ G, where G is a distribution sampled from our Dirichlet Process, G ∼ DP (G0 , α). We could generate these xi values by taking a Polya Urn Model with color distribution G0 and dispersion α - then xi could be the color of the i th ball in the urn. Or we could generate these xi by assigning customers to tables via a CRP with dispersion α. Then all the customers for a table is given the same value (e.g. color) sampled from G0 . xi is the value/color given to the i th customer; here xi can be thought of as the parameters for table i . Or we could generate weights wk via a Stick-Breaking Process with dispersion α. Then we give each weight wk a value/color vk sampled from G0 . We assign xi to vk with probability wk . More formally: • Generate a distribution G ∼ DP (G0 , α) from a Dirichlet process with base distribution G0 and a dispersion parameter α. • Generate group-level parameters xi ∼ G where xi is the group parameter for the i th datapoint. Note that xi is not the same as ϕi ; xi is the parameter associated to the group that the i th data point belongs to whereas ϕk is the parameter of the kth distinct group. • Given group-level parameters xi , generate each datapoint pi ∼ F (xi ) (where we are using F is a way like in the Chinese Restaurant Process above). CHAPTER 16. BAYESIAN LEARNING 475 16.7. MODEL SELECTION 476 16.7 Model selection 16.7.1 Model fitting vs Model selection Model fitting is just about fitting a particular model to data, e.g. minimizing error against it. Say we use high-degree polynomial as our model (i.e. use more than one predictor variable). The resulting fit model might not actually be appropriate for the data - it may overfit it, for instance, or be overly complex. Now way we fit a straight line (i.e. use just one predictor variable). We might find that the straight line is a better model for the data. The process of choosing a between these models is called model selection. So we need some way of quantifying the quality of models in order to compare them. A naive approach is to use the likelihood (the product of the probabilities of each datapoint), or more commonly, the log-likelihood (the sum of the log probabilities of each datapoint) and then select the model with the greatest likelihood (this is the maximum likelihood approach). This method is problematic, however, because more complicated (higher-degree) polynomial models will always have a higher likelihood, though they are not necessarily better in the sense that we mean (they overfit the data). More complex model, greater data likelihood source 16.7.2 Model fitting Say you have datapoints x1 , . . . , xn and errors for those datapoints e1 , . . . , en . Say there is some true value for x, we’ll call it xtrue , that we want to learn. A frequentist approach assumes this true value is fixed and that the data is random. So in this case, we consider the distribution P (xi , ei |xtrue ) and want to identify a point estimate - that is, a single value - for xtrue . This distribution tells us the probability of a point xi with its error ei . For instance, if we assume that x is normally distributed: 476 CHAPTER 16. BAYESIAN LEARNING 477 16.7. MODEL SELECTION 1 −(xi − xtrue )2 exp( ) P (xi , ei |xtrue ) = √ 2ei2 2πei2 Then we can consider the likelihood of the data overall by taking the product of the probabilities of each individual datapoint: L(X, E) = n ∏ P (xi , ei |xtrue ) i=1 Though typically we work with the log likelihood to avoid underflow errors: log L(X, E) = n 1∑ (xi − xtrue )2 (log(2πei2 ) + ) 2 i=1 ei2 A common frequentist approach to fitting a model is to use maximum likelihood. That is, find an estimate for xtrue which maximizes this log likelihood: argmax log L xtrue Equivalently, we could minimize the loss (e.g. the squared error). For simple cases, we can compute the maximum likelihood estimate analytically, by solving d log L dxtrue =0 When all the errors ei are equal, this ends up reducing to: xtrue = n 1∑ xi n i=1 That is, the mean of the datapoints. For more complex situations, we instead use numerical optimization (i.e. we approximate the estimate). The Bayesian approach instead involves looking at P (xtrue |xi , ei ), that is, we look at a probability distribution for the unknown value based on fixed data. We aren’t looking for a point estimate (a single value) any more, but rather describe xtrue as a probability distribution. If we do want a point estimate (often you have to have a concrete value to work with), we can take the expected value from the distribution. P (xtrue |xi , ei ) is computed: P (xtrue |xi , ei ) = P (xi , ei |xtrue )P (xtrue ) P (xi , ei ) Which is to say, it is the posterior distribution. For simple cases, the posterior can be computed analytically, but more often you will need Markov Chain Monte Carlo to approximate it. CHAPTER 16. BAYESIAN LEARNING 477 16.7. MODEL SELECTION 16.7.3 478 Model Selection Just as model fitting differs between frequentist and Bayesian approaches, so does model selection. Frequentists compare model likelihood, e.g., for two models M1 , M2 , they would compare P (D|M1 ), P (D|M2 ). Bayesians compare the model posterior, e.g. P (M1 |D), P (M2 |D). The parameters are left out in both cases here since we aren’t concerned with how good the fit of the model is, but rather, how appropriate the model itself is as a “type” of model. We can use Bayes theorem to turn the posterior into something we can compute: P (M | D) = P (D | M) P (M) P (D) Using conditional probability, we know that P (D | M) can be computed as the integral over the parameter space of the likelihood: P (D | M) = ∫ Ω P (D | θ, M)P (θ | M)dθ Computing P (D) - the probability of seeing your data at all - is really hard, impossible even. But we can avoid dealing with it by comparing P (M1 | D) and P (M2 | D) as an odds ratio: O21 ≡ P (M2 | D) P (D | M2 ) P (M2 ) = P (M1 | D) P (D | M1 ) P (M1 ) 2) We still have to deal with PP (M (M1 ) , which is known as the prior odds ratio (because P (M1 ), P (M2 ) are priors). This ratio is assumed to equal 1 if there’s no reason to believe or no prior evidence that one model will do better than the other. | M2 ) The remaining ratio PP (D (D | M1 ) is known as the Bayes factor and is the most important part here. The integrals needed to compute the Bayes factor can be approximated using MCMC. 16.7.4 Model averaging We aren’t required to choose just one model - rather, with Bayesian model averaging we can combine as many as we’d like. The basic approach is to define a prior over our models, compute a posterior over the models given the data, and then combine the outputs of the models as a weighted average, using models’ posterior probabilities as weights. 478 CHAPTER 16. BAYESIAN LEARNING 479 16.8. REFERENCES 16.8 References • Review of fundamentals, IFT725. Hugo Larochelle. 2012. • Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Bayesian Machine Learning. Roger Grosse. • Kernel Density Estimation and Kernel Regression. Justin Esarey. • Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. • Frequentism and Bayesianism V: Model Selection. Jake Vanderplas. • Bayesian Nonparameterics 1. Machine Learning Summer School 2013. Max Planck Institute for Intelligent Systems, Tübingen. Yee Whye Teh. • Lecture 15: Learning probabilistic models. Roger Grosse, Nitish Srivastava. • Infinite Mixture Models with Nonparametric Bayes and the Dirichlet Process. Edwin Chen. CHAPTER 16. BAYESIAN LEARNING 479 16.8. REFERENCES 480 480 CHAPTER 16. BAYESIAN LEARNING 481 17 NLP 17.1 Problems Some (higher-level) problems that fall under NLP include: • • • • • machine translation (structured) information extraction summarization natural language interfaces speech recognition At a lower-level, these include the following problems: • • • • • part-of-speech tagging parsing word-sense disambiguation named entity recognition etc 17.2 Challenges Ambiguity is one of the greatest challenges to NLP: For example: Fed raises interest rates, where “raises” is the verb, and “Fed” is the noun phrase Fed raises interest rates, where “interest” is the verb, and “Fed raises” is the noun phrase CHAPTER 17. NLP 481 17.3. TERMINOLOGY 482 This ambiguity occurs at many levels: • the acoustic level: e.g. mixing up similar-sounding words • the syntactic level: e.g. multiple plausible grammatical parsings of a sentence • the semantic level: e.g. some words can mean multiple things (“bank” as in a river or a financial institution); this is called word sense ambiguity • the discourse (multi-clause) level: e.g. unclear what a pronoun is referring to Other challenges include: • non-standard english: for instance, text shorthand, phrases such as “SOOO PROUD” as opposed to “so proud”, or hashtags, etc • segmentation issues: [the] [New] [York-New] [Haven] [Railroad] vs. [the] [New York]-[New Haven] [Railroad] • idioms (e.g. “get cold feet”, doesn’t literally mean what it says) • neologisms (e.g. “unfriend”, “retweet”, “bromance”) • world knowledge (e.g. “Mary and Sue are sisters” vs “Mary and Sue are mothers.”) • tricky entity names: “Where is A Bug’s Life playing”, or “a mutation on the for gene” The typical approach is to codify knowledge about language & knowledge about the world and find some way to combine them to build probabilistic models. 17.3 Terminology • synset: a synset is a set of synonyms that represent a single sense of a word. • wordform: the full inflected surface form: e.g. “cat” and “cats” are different wordforms. • lemma: the same stem, part of speech, rough word sense; e.g. “cat” and “cats” are the same lemma. – One lemma can have many meanings. For example: a bank can hold investments… agriculture on the east bank… • sense: a discrete representation of an aspect of a word’s meaning. The usages of bank in the previous example have a different sense. • homonyms: words that share form but have unrelated, distinct meanings (such as “bank”). – Homographs: bank/bank, bat/bat – Homophones: write/right, piece/peace • polysemy: – A polysemous word has related meanings, for example: * “the bank was built in 1875 (”bank” = a building belonging to a financial institution)” 482 CHAPTER 17. NLP 483 17.4. DATA PREPARATION * “I withdrew money from the bank (”bank” = a financial institution)” – Systematic polysemy, or metonymy, is when the meanings have a systematic relationship. – For example, “school”, “university”, “hospital” - all can mean the institution or the building, so the systematic relationship here is building <=> organization. – Another example is author <=> works of author, e.g. “Jane Austen wrote Emma” and “I love Jane Austen”. • synonyms: different words that have the same propositional meaning in some or all contexts. However, there may be no examples of perfect synonymy since even if propositional meaning is identical, they may vary in notions of politeness or other usages and so on. – For example, “water” and “H2O” - each are more appropriate in different contexts. – As another example, “big” and “large” - sometimes they can be swapped, sometimes they cannot: That’s a big plane. How large is that plane? (Acceptable) Miss Nelson became kind of a big sister to Benjamin. Miss Nelson became kind of a large sister to Benjamin (Not as acceptable) The latter works less because “big” has multiple senses, one of which does not correspond to “large”. • antonyms: -senses which are opposite with respect to one feature of meaning, but otherwise are similar, such as dark/light, short/fast, etc. • hyponym: one sense is a hyponym of another if the first sense is more specific (i.e. denotes a subclass of the other). – car is a hyponym of vehicle – mango is a hyponym of fruit • hypernym/superordinate: – vehical is a hypernym of car – fruit is a hypernym of mango • token: an instance of that type in running text; N = number of tokens, i.e. counting every word in the sentence, regardless of uniqueness. • type: an element of the vocabulary; V = vocabulary = set of types (|V | = the size of the vocabulary), i.e. counting every unique word in the sentence. 17.4 Data preparation 17.4.1 Sentence segmentation “!”, “?” are pretty reliable indicators that we’ve reached the end of a sentence. Periods can mean the end of the sentence or an abbreviation (e.g. Inc. or Dr.) or numbers (e.g. 4.3). CHAPTER 17. NLP 483 17.4. DATA PREPARATION 17.4.2 484 Tokenization Tokenization is the process of breaking up text into discrete units for analysis - this is typically into words or phrases. The best approach for tokenization varies widely depending on the particular problem and language. German, for example, has many long compound words which you may want to split up. Chinese has no spaces (no easy way for word segmentation), Japanese has no spaces and multiple alphabets. 17.4.3 Normalization Once you have your tokens you need to determine how to normalize them. For example, “USA” and “U.S.A.” could be collapsed into a single token. But about “Windows”, “window”, and “windows”? Some common approaches include: • case folding - reducing all letters to lower case (but sometimes case may be informative) • lemmatization - reduce inflections or variant forms to base form. • stemming = reducing terms of their stems; a crude chopping of affixes; a simplified version of lemmatization. The Porter stemmer is the most common English stemmer. 17.4.4 Term Frequency-Inverse Document Frequency (tf-idf) Weighting Using straight word counts may not be the best approach in many cases. Rare terms are typically more informative than frequent terms, so we want to bias our numerical representations of tokens to give rarer words higher weights. We do this via inverse document frequency weighting (idf): i dft = log( N ) dft For a term t which appears in df documents (dft = document frequency for t). log is used here to “dampen” the effect of idf. This can be combined with t’s term frequency tfd for a particular document d to produce tf-idf weighting, which is the best known weighting scheme for text information retrieval: wt,d = (1 + log tft,d ) × log( 17.4.5 n ) dft The Vector Space Model (VSM) This representation of text data - that is, some kind of numerical feature for each word, such as the tf-idf weight and frequency, defines a |V |-dimensional vector space (where V is the vocabulary size). 484 CHAPTER 17. NLP 485 • • • • 17.5. MEASURING SIMILARITY BETWEEN TEXT terms are the axes of space documents are points (vectors) in this space this space is very high-dimensional when dealing with large vocabularies these vectors are very sparse - most entries are zero 17.4.6 Normalizing vectors This is a different kind of normalization than the previously mentioned one, which was about normalizing the language. Here, we are normalizing vectors in a more mathematical sense. Vectors can be length-normalized by dividing each of its components by its length. We can use the L2 norm, which makes it a unit vector (“unit” means it is of length 1): ||⃗ x ||2 = √∑ xi2 i This means that if we have, for example, a document and copy of that document with every word doubled, length normalization causes each to have identical vectors (without normalization, the copy would have been twice as long). 17.5 Measuring similarity between text 17.5.1 Minimum edit distance The minimum edit distance between two strings is the minimum number of editing operations (insertion/deletion/substitution) needed to transform one into the other. Each editing operation has a cost of 1, although in Levenshtein minimum edit distance substitutions cost 2 because they are composed of a deletion and an insertion. 17.5.2 Jaccard coefficient The Jaccard coefficient is a commonly-used measure of overlap for two sets A and B. jaccar d(A, B) = |A ∩ B| |A ∪ B| A set has a Jaccard coefficient of 1 against itself: jaccar d(A, A) = 1. If A and B have no overlapping elements, jaccar d(A, B) = 0. The Jaccard coefficient does not consider term frequency, just set membership. CHAPTER 17. NLP 485 17.5. MEASURING SIMILARITY BETWEEN TEXT 17.5.3 486 Euclidean Distance Using the vector space model above, the similarity between two documents can be measured by the euclidean distance between their two vectors. However, euclidean distance can be problematic since longer vectors have greater distance. For instance, there could be one document vector, a, and another document vector b which is just a scalar multiple of the first document. Intuitively they may be more similar since they lie along the same line. But by euclidean distance, c is closer to a. Euclidean distances 17.5.4 Cosine similarity In cases like the euclidean distance example above, using angles between vectors can be a better metric for similarity. For length-normalized vectors, cosine similarity is just their dot product: ⃗ = q⃗ · d⃗ = cos(⃗ q , d) |V | ∑ qi di i=1 Where q and d are length-normalized vectors and qi is the tf-idf weight of term i in document q and di is the tf-idf weight of term i in document d. 486 CHAPTER 17. NLP 487 17.6. (PROBABILISTIC) LANGUAGE MODELS 17.6 (Probabilistic) Language Models The approach of probabilistic language models involves generating some probabilistic understanding of language - what is likely or unlikely. For example, given sentence A and sentence B, we want to be able to say whether or not sentence A is more probable sentence than sentence B. We have some finite vocabulary V . There is an infinite set of strings (“sentences”) that can be produced from V , notated V † (these strings have zero or more words from V , ending with the STOP symbol). These sentences may make sense, or they may not (e.g. they might be grammatically incorrect). Say we have a training sample of N example sentences in English. We want to learn a probability distribution p over the possible set of sentences V † ; that is, p is a function that satisfies: ∑ p(x) = 1, p(x) ≥ 0for allx ∈ V † x∈V † The goal is for likely English sentences (i.e. “correct” sentences) to be more probable than nonsensical sentences. These probabilistic models have applications in many areas: • Machine translation: P (high winds tonight) > P (large winds tonight). • Spelling correction: P (about fifteen minutes from) > P (about fifteen minuets from). • Speech recognition: P (I saw a van) > P (eyes awe of an). So generally you are asking: what is the probability of this given sequence of words? 17.6.1 A naive method For any sentence x1 , . . . , xn , we notate the count of that sentence in the training corpus as c(x1 , . . . , xn ). Then we might simply say that: p(x1 , . . . , xn ) = c(x1 , . . . , xn ) N However, this method assigns 0 probability to sentences that are not in the training corpus, thus leaving many plausible sentences unaccounted for. 17.6.2 A less naive method You could use the chain rule here: CHAPTER 17. NLP 487 17.6. (PROBABILISTIC) LANGUAGE MODELS 488 P (the water is so transparent) = P (the) × P (water|the) × P (is|the water) ×P (so|the water is) × P (transparent|the water is so) Formally, the above would be expressed: P (w1 w2 . . . wn ) = ∏ P (wi |w1 w2 . . . wi−1 ) i Note that probabilities are usually done in log space to avoid underflow, which occurs if you’re multiplying many small probabilities together, and because then you can just add the probabilities, which is faster than multiplying: p1 × p2 × p3 = log p1 + log p2 + log p3 To make estimating these probabilities manageable, we use the Markov assumption and assume that a given word’s conditional probability only depends on the immediately preceding k words, not the entire preceding sequence (that is, that any random variable depends only on the previous random variable, and is conditionally independent of all the random variables before that): P (X1 = x1 ) n ∏ P (Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P (X1 = x1 ) i=2 n ∏ P (Xi = xi |Xi−1 = xi−1 ) i=2 That is, for any i ∈ 2 . . . n, for any x1 , . . . , xi : P (Xi = xi |X1 = x1 , . . . , Xi−1 = xi−1 ) = P (Xi = xi |Xi−1 = xi−1 ) In particular, this is the first-order Markov assumption; if it seems appropriate, we could instead use the second-order Markov assumption, where we instead assume that any random variable depends only on the previous two random variables: P (X1 = x1 , X2 = x2 , . . . , Xn = xn ) = P (X1 = x1 )P (X2 = x2 |X1 = x1 ) n ∏ P (Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 i=3 Though this is usually condensed to: n ∏ P (Xi = xi |Xi−2 = xi−2 , Xi−1 = xi−1 ) i=1 488 CHAPTER 17. NLP 489 17.6. (PROBABILISTIC) LANGUAGE MODELS This can be extended to the third-order Markov assumption and so on. In the context of language models, we define x−1 , x0 as the special “start” symbol, ∗, indicating the start of a sentence. We also remove the assumption that n is fixed and instead consider it as a random variable. We can / V. just define Xn = STOP, where STOP is a special symbol, STOP ∈ 17.6.3 n-gram Models The unigram model treats each word as if they have an independent probability: ∏ P (w1 w2 . . . wn ) ≈ P (wi ) i The bigram model conditions on the previous word: P (w1 w2 . . . wi−1 ) ≈ ∏ P (wi |wi−1 ) i We estimate bigram probabilities using the maximum likelihood estimate (MLE): PMLE (wi |wi−1 ) = count(wi−1 , wi ) count(wi−1 ) Which is just the count of word i occuring after word i − 1 over all of the occurences of word i − 1. This can be extended to trigrams, 4-grams, 5-grams, etc. Though language has long-distance dependencies, i.e. the probability of a word can depend on another word much earlier in the sentence, n-grams work well in practice. Trigram Models With a trigram model, we have a parameter q(w |u, v ) for each trigram (sequence of three words) u, v , w such that w ∈ V ∪ {STOP} and u, v ∈ V ∪ {∗}. For any sentence x1 , . . . , xn , where xi ∈ V for i = 1 . . . (n − 1) and xn = STOP, the probability of the sentence under the trigram language model is: p(x1 , . . . , xn ) = ∏ i = 1n q(xi |xi−2 , xi−1 ) With x−1 , x0 as the special “start” symbol, ∗. (This is just a second-order Markov process) So then, how do we estimate the q(wi |wi−2 , wi−1 ) parameters? CHAPTER 17. NLP 489 17.6. (PROBABILISTIC) LANGUAGE MODELS 490 We could use the maximum likelihood estimate: Count(wi−2 , wi−1 , wi ) Count(wi−2 , wi−1 ) qML (wi |wi−2 , wi−1 ) = However, this still has the problem of assigning 0 probability to trigrams that were not encountered in the training corpus. There are also still many, many parameters to learn: if we have a vocabulary size N = |V |, then we have N 3 parameters in the model. Dealing with zeros Zeroes occur if some n-gram occurs in the testing data which didn’t occur in the training set. Say we had the following training set: … denied the reports … denied the claims … denied the request And the following test set: … denied the offer Here P (offer|denied the) = 0 since the model has not encountered that term. We can get around this using Laplace smoothing, also known as add-one smoothing): simply pretend that we saw each word once more than we actually did (i.e. add one to all counts). With add-one smoothing, our MLE becomes: PAdd−1 (wi |wi−1 ) = count(wi−1 , wi ) + 1 count(wi−1 ) + V Note that this smoothing can be very blunt and may drastically change your counts. Interpolation Above we defined the trigram maximum-likelihood estimate. We can do the same for bigram and unigram estimates: Count(wi−1 , wi ) Count(wi−1 ) Count(wi ) qML (wi ) = Count() qML (wi |wi−1 ) = 490 CHAPTER 17. NLP 491 17.6. (PROBABILISTIC) LANGUAGE MODELS These various estimates demonstrate the bias-variance trade-off - the trigram maximum-likelihood converges to a better estimate but requires a lot more data to do so; the unigram maximum-likelihood estimate converges to a worse estimate but does so with a lot less data. With linear interpolation, we try to combine the strengths and weaknesses of each of these estimates: q(wi |wi−2 , wi−1 ) = λ1 qML (wi |wi−2 , wi−1 ) + λ2 qML (wi |wi−1 ) + λ3 qML (wi ) Where λ1 + λ2 + λ3 = 1, λi ≥ 0∀i . That is, we compute a weighted average of the estimates. For a vocabulary V ′ = V ∪ {STOP}, ∑ w ∈V ′ q(w |u, v ) defines a distribution, since it sums to 1. How do we estimate the λ values? We can take out some our training data as validation data (say ~5%). We train the maximumlikelihood estimates on the training data, then we define c ′ (w1 , w2 , w3 ) as the count of a trigram in the validation set. Then we define: L(λ1 , λ2 , λ3 ) = ∑ c ′ (w1 , w2 , w3 ) log q(w3 |w1 , w2 ) w1 ,w2 ,w3 And choose λ1 , λ2 , λ3 to maximize L (this ends up being the same as choosing λ1 , λ2 , λ3 to minimize the perplexity). In practice, however, the λ values are allowed to vary. We define a function Π that partitions histories, e.g. 1 if Count(wi−1 , wi−2 ) = 0 2 if 1 ≤ Count(wi−1 , wi−2 ) ≤ 2 Π(wi−2 , wi−1 ) = 3 if 3 ≤ Count(wi−1 , wi−2 ) ≤ 5 4 otherwise These partitions are usually chosen by hand. Then we vary the λ values based on the partition: Π(wi−2 ,wi−1 ) q(wi |wi−2 , wi−1 ) = λ1 Π(wi−2 ,wi−1 ) Where lambda1 CHAPTER 17. NLP Π(wi−2 ,wi−1 ) qML (wi |wi−2 , wi−1 )+λ2 Π(wi−2 ,wi−1 ) + lambda2 Π(wi−2 ,wi−1 ) qML (wi |wi−1 )+λ3 qML (wi ) + lambda3 Π(wi−2 , wi−1 ) = 1 and each are ≥ 0. 491 17.6. (PROBABILISTIC) LANGUAGE MODELS 492 Discounting methods Generally, these maximum likelihood estimates can be high, so we can define “discounted” counts, e.g. Count ∗ (x) = Count(x) − 0.5 (the value to discount by can be determined on a validation set, like the λ values from before). As a result of these discounted counts, we will have some probability mass left over, which is defined as: α(wi−1 ) = 1 − ∑ Count ∗ (wi−1 , w ) w Count(wi−1 ) We can assign this leftover probability mass to words we have not yet seen. We can use a Katz Back-Off model. First we will consider the bigram model. We define two sets: A(wi−1 ) = {w : Count(wi−1 , w ) > 0} B(wi−1 ) = {w : Count(wi−1 , w ) = 0} Then the bigram model: i−1 ,wi ) Count∗(w Count(w ) i−1 qBO (wi |wi−1 ) = α(wi−1 ) ∑ if wi ∈ A(wi−1 ) qML (wi ) qML (w ) w ∈B(wi−1 ) if wi ∈ B(wi−1 ) Where α(wi−1 ) = 1 − Count ∗ (wi−1 , w ) Count(wi−1 ) w ∈A(wi−1 ) ∑ Basically, this assigns the leftover probability mass to bigrams that were not previously encountered. The Katz Back-Off model can be extended to trigrams as well: A(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w ) > 0} B(wi−2 , wi−1 ) = {w : Count(wi−2 , wi−1 , w ) = 0} i−2 ,wi−1 ,wi ) Count∗(w Count(w ,w ) if wi ∈ A(wi−2 , wi−1 ) i−2 i−1 qBO (wi |wi−2 , wi−1 ) = α(wi−2 , wi−1 ) ∑ qBO (wi |wi−1 ) q (w |wi−1 ) w ∈B(w ,w ) BO i−2 α(wi−2 , wi−1 ) = 1 − 492 if wi ∈ B(wi−2 , wi−1 ) i−1 Count ∗ (wi−2 , wi−1 , w ) Count(wi−2 , wi−1 ) w ∈A(wi−2 ,wi−1 ) ∑ CHAPTER 17. NLP 493 17.6.4 17.6. (PROBABILISTIC) LANGUAGE MODELS Log-Linear Models When it comes to language models, the trigram model may be insufficient. There may be more information than just the previous two words that we want to take into account - for instance, the author of a paper, whether or not a particular word occurs in an earlier context, the part of speech of the preceding word, etc. We may want to do something similar when it comes to tagging, e.g. condition on that a previous word is a particular word, or that it has a particular ending (“ing”, “e”, etc), and so on. We can use log-linear models to capture this extra information (encoded as numerical features, e.g. 1 if the preceding word is “foo”, and 0 otherwise.). With log-linear models, we frame the problem as such: We have some input domain X and a finite label set Y . We want to produce a conditional probability p(y |x) for any x, y where x ∈ X, y ∈ Y . For example, in language modeling, x would be a “history” of words, i.e. w1 , w2 , . . . , wi−1 and y is an “outcome” wi (i.e. the predicted following word). We represent our features as vectors (applying indicator functions and so on where necessary). We’ll denote a feature vector for an input/output pair (x, y ) as f (x, y ). We also have a parameter vector equal in length to our feature vectors (e.g. if we have m features, then the parameter vector v ∈ Rm ). We can compute a “score” for a pair (x, y ) as just the dot product of these two: v · f (x, y ) which we can turn into the desired conditional probability p(x|y ): p(y |x; v ) = ∑ e v ·f (x,y ) v ·f (x,y ′ ) y ′ ∈Y e Read as “the probability of y given x under the parameters v ”. This can be re-written as: log p(y |x; v ) = v · f (x, y ) − log ∑ ′ e v ·f (x,y ) y ′ ∈Y This is why such models are called “log-linear”: the v ·f (x, y ) term is the linear term and we calculate ∑ ′ a log probability (and then there is the normalization term log y ′ ∈Y e v ·f (x,y ) ). So how do we estimate the parameters v ? We assume we have training examples (x (i) , y (i) ) for i = 1, . . . , n and that each (x (i) , y (i) ) ∈ X × Y . We can use maximum-likelihood estimates to estimate v , i.e. vML = argmax L(v ) L(v ) = CHAPTER 17. NLP v ∈Rm n ∑ n ∑ i=1 i=1 log p(y (i) |x (i) ; v ) = v · f (x (i) , y (i) ) − n ∑ i=1 log ∑ e v ·f (x (i) ,y ′ ) y ′ ∈Y 493 17.6. (PROBABILISTIC) LANGUAGE MODELS 494 i.e. L(v ) is the log-likelihood of the data under the parameters v , and it is concave so we can optimize it fairly easily with gradient ascent. We can add regularization to improve generalization. 17.6.5 History-based models The models that have been presented so far are called history-based models, in the following sense: • • • • • We break structures down into a derivation (a sequence of decisions) Each decision has an associated conditional probability The probability of a structure is just the product of the decision probabilities that created it The parameter values are estimated using some variant of maximum-likelihood estimation When choose y s such that they maximize either a joint probability p(x, y ; θ) (e.g. in the case of HMMs or PCFGs) or a conditional probability p(y |x; θ) (in the case of log-linear models). 17.6.6 Global Linear Models GLMs extend log-linear models though they are different than history-based models (there are no “derivations” or probabilities for “decisions”). In GLMs, we have feature vectors for entire structures, i.e. “global features”. This allows us to incorporate features that are difficult to include in history-based models. GLMs have three components: • f (x, y ) ∈ Rd which maps a structure (x, y ) (e.g. a sentence and a parse tree) to a feature vector$ to a feature vector$ to a feature vector$ to a feature vector • GEN which is a function that maps an input x to a set of candidates GEN(x). For example, it could return the set of all possible English translations for a French sentence x. • v ∈ Rd is a parameter vector; it is learned from training data So the final output is a function F : X → Y , which ends up being: F (x) = argmax f (x, y ) · v y ∈GEN(x) 17.6.7 Evaluating language models: perplexity Perplexity is a measure of the quality of a language model. Assume we have a set of m test sentences, s1 , s2 , . . . , sm . We can compute the probability of these sentences under our learned model p: 494 CHAPTER 17. NLP 495 17.7. PARSING m ∏ p(si ) i=1 Though typically we look at log probability instead: m ∑ log p(si ) i=1 The perplexity is computed: perplexity = 2−l l= m 1 ∑ log p(si ) M i=1 Where M is the total number of words in the test data. Note that log is log2 . Lower perplexity is better (because a high log probability is better, which causes perplexity to be low). 17.7 Parsing The parsing problem takes some input sentence and outputs a parse tree which describes the syntactic structure of the sentence. The leaf nodes of the tree are the words themselves, which are each tagged with a part-of-speech. Then these are grouped into phrases, such as noun phrases (NP) and verb phrases (VP), up to sentences (S) (these are sometimes called constituents). These parse trees can describe grammatical relationships such as subject-verb, verb-object, and so on. [Example of a parse tree, from Tjo3ya] We can treat it as a supervised learning problem by using sentences annotated with parse trees (such data is usually called a “treebank”). 17.7.1 Context-free grammars (CFGs) A formalism for the parsing problem. A context-free grammar is a four-tuple G = (N, Σ, R, S) where: • N is a set of non-terminal symbols • Σ is a set of terminal symbols • R is a set of rules of the form X → Y1 Y2 . . . Yn for n ≥ 0, X ∈ N, Yi ∈ (N ∪ Σ) CHAPTER 17. NLP 495 17.7. PARSING 496 • S ∈ N is a distinguished start symbol An example CFG: • • • • N = {S, NP, VP, PP, DT, Vi, Vt, NN, IN} S=S Σ = {sleeps, saw, woman, telescope, the, with, in} R is the following set of rules: – – – – – – – – – – – – – – – S → NP VP VP → Vi VP → Vt NP VP → VP PP NP → DT NN NP → NP PP PP → IN NP Vi → sleeps Vt → saw NN → man NN → woman NN → telescope DT → the IN → with IN → in Note: - S = sentence - VP = verb phrase - NP = noun phrase - PP = prepositional phrase - DT = determiner - Vi = intransitive verb - Vt = transitive verb - NN = noun - IN = preposition We can derive sentences from this grammar. A left-most derivation is a sequence of strings s1 , . . . , sn where: • s1 = S, the start symbol • sn ∈ Σ∗ ; that is, sn consists only of terminal symbols • each si for i = 2, . . . , n is derived from si−1 by picking the left-most non-terminal X in si−1 and replacing it with some β where X → β is a rule in R. Using the example grammar, we could do: 1. 2. 3. 4. 5. 496 “S” expand “S” to “NP VP” expand “NP” (since it is the left-most symbol) to “D N”, yielding “D N VP” expand “D” (again, it is left-most) to “the”, yielding “the N VP” expand “N” (since the left-most symbol “the” is a terminal symbol) to “man”, yielding “the man VP” CHAPTER 17. NLP 497 17.7. PARSING 6. expand “VP” to “Vi” (since it is the last non-terminal symbol), yielding “the man Vi” 7. expand “Vi” to “sleeps”, yielding “the man sleeps” 8. the sentence consists only of terminal symbols, so we are done. Thus a CFG defines a set of possible derivations, which can be infinite. We say that a string s ∈ Σ∗ is in the language defined by the CFG if we can derive it from the CFG. A string in a CFG may have multiple derivations - this property is called “ambiguity”. For instance, “fruit flies like a banana” is ambiguous in that “fruit flies” may be a noun phrase or it may be a noun and a verb. 17.7.2 Probabilistic Context-Free Grammars (PCFGs) PCFGs are CFGs in which each rule is assigned a probability, which helps with the ambiguity problem. We can compute the probability of a particular derivation as the product of the probability of its rules. We notate the probability of a rule as q(α → β). Note that we have individual probability distributions ∑ ∑ for the left-side of each rule, e.g. q(VP → β) = 1, q(NP → β = 1, and so on. Another way of saying this is these distributions are conditioned on the left-side of the rule. These probabilities can be learned from data as well, simply by counting all the rules in a treebank and using maximum likelihood estimates: qML (α → β) = Count(α → β) Count(α) Given a PCFG, a sentence s, and a set of trees which yield s as T(s), we want to compute argmaxt∈T(s) p(t). That is, given a sentence, what is the most likely parse tree to have produced this sentence? A challenge here is that |T(s)| may be very large, so brute-force search is not an option. We can use the CKY algorithm instead. First we will assume the CFG is in Chomsky normal form. A CFG is in Chomsky normal form if the rules in R take one of two forms: • X → Y1 Y2 for X, Y1 , Y2 ∈ N • X → Y for X ∈ N, Y ∈ Σ In practice, any PCFG can be converted to an equivalent PCFG in Chomsky normal form by combining multiple symbols into single symbols (e.g. you can convert VP → Vt NP PP by defining a new symbol Vt-NP → Vt NP and then redefining VP → Vt-NP PP). First, let’s consider the problem maxt∈T(s) p(t). Notation: • n = number of words in the sentence CHAPTER 17. NLP 497 17.7. PARSING 498 • wi = the i th word in the sentence We define a dynamic programming table π[i, j, X] which is the maximum probability of a constituent with non-terminal X spanning the words i , . . . , j inclusive. We set i, j ∈ 1, . . . , n and i ≤ j. We want to calculate maxt∈T(s)p(t) = π[1, n, S], i.e. the max probability for a parse tree spanning the first through the last word of the sentence with the S symbol. We will use a recursive definition of π. The base case is: for all i = 1, . . . , n for X ∈ N, π[i, i , X] = q(X → wi ). If X → wi is not in the grammar, then q(X → wi ) = 0. The recursive definition is: for all i = 1, . . . , (n − 1) and j = (i + 1), . . . , n and X ∈ N: π(i, j, X) = max X→Y Z∈R,s∈{i,...,(j−1)} q(X → Y Z)π(i, s, Y )π(s + 1, j, Z) s is called the “split point” because it determines where the word sequence from i to j (inclusive) is split. The full CKY algorithm: Initialization: For all i ∈ {i, . . . , n}, for all X ∈ N: q(X → xi ) π(i , i , X) = 0 ifX → xi ∈ R otherwise Then: • For l = 1, . . . , (n − 1) – For i = 1, . . . , (n − l) * Set j = i + 1 * For all X ∈ N, calculate: π(i , j, X) = max X→Y Z∈R,s∈{i,...,(j−1)} bp(i , j, X) = argmax X→Y Z∈R,s∈{i,...,(j−1)} q(X → Y Z)π(i, s, Y )π(s + 1, j, Z) q(X → Y Z)π(i, s, Y )π(s + 1, j, Z) This has the runtime O(n3 |N|3 ) because the l and i loops n times each, giving us n2 , then at the inner-most loop (for all X ∈ N) loops |N| times, then X → Y Z ∈ R has |N|2 values to search through because these are |N| choices for Y and |N| choices for Z. Then there are also n choices to search through for s. 498 CHAPTER 17. NLP 499 17.7. PARSING Weaknesses of PCFGs PCFGs (as described above) don’t perform very well; they have two main shortcomings: • Lack of sensitivity to lexical information – that is, attachment is completely independent of the words themselves • Lack of sensitivity to structural frequencies – for example, with the phrase “president of a company in Africa”, “in Africa” can be attached to either “president” or “company”. If we were to parse this phrase, we might come up with two trees described by exactly the same rule sets, the only difference is where the PP “in Africa” is attached to. Since they are exactly the same rule sets, they have the same probability, so the PCFG can’t distinguish the two. However, statistically, the “close attachment” structure (i.e. generally the PP would attach to the closer object, in this case, “company”) is more frequent, so it should be preferred. Lexicalized PCFGs Lexicalized PCFGs deal with the above weaknesses. For a non-terminal rule, we specify one its children as the “head” of the rule, which is essentially the most “important” part of the rule (e.g. for the rule VP → VtNP, the verb Vt is the most important semantic part and thus the head). We define another set of rules which identifies the heads of our grammar’s rules, e.g. “If the rule contains NN, NNS, or NNP, choose the rightmost NN, NNS, or NNP as the head”. Now when we construct the tree, we annotate each node with its headword (that is, the word that is in the place of the head of a rule). For instance, say we have the following tree: VP ��� Vt � ��� questioned ��� NP ��� DT � ��� the ��� NN ��� witness We annotate each node with its headword: VP(questioned) ��� Vt(questioned) � ��� questioned CHAPTER 17. NLP 499 17.7. PARSING 500 ��� NP(witness) ��� DT(the) � ��� the ��� NN(witness) ��� witness We can revise our Chomsky Normal Form for lexicalized PCFGs by defining the rules in R to have one of the following three forms: • X(h) →1 Y1 (h)Y2 (w ) for X, Y1 , Y2 ∈ N and h, w ∈ Σ • X(h) →2 Y1 (w )Y2 (h) for X, Y1 , Y2 ∈ N and h, w ∈ Σ • X(h) → h for X ∈ N, h ∈ Σ Note the subscripts on →1 , →2 which indicate which of the children is the head. Parsing lexicalized PCFGs That is, we consider rules with words, e.g. NN(dog) is a different rule than NN(cat). By doing so, we increase the number of possible rules to O(|Σ|2 |N|3 ), which is a lot. However, given a sentence w1 , w2 , . . . , wn , at most O(n2 |N|3 ) rules are applicable because we can disregard any rule that does not contain one of w1 , w2 , . . . , wn ; this makes parsing lexicalized PCFGs a bit easier (it can be done in O(N 5 |N|3 ) time rather than O(n3 |Σ|2 |N|3 ) time, which is the runtime if we consider all possible rules). Parameter estimatino in lexicalized PCFGs In a lexicalized PCFGs, our parameters take the form: q(S(saw) →2 NP(man)VP(saw)) We decompose this parameter into a product of two parameters: q(S →2 NP VP|S, saw)q(man|S →2 NP VP, saw) The first term describes: given S(saw), what is the probability that it expands →2 NP VP? The second term describes: given the rule S →2 NP VP and the headword saw, what is the probability that man is the headword of NP? Then we used smoothed estimation for the two parameter estimates (we’re using linear interpolation): q(S →2 NP VP|S, saw) = λ1 qML (S →2 NP VP|S, saw) + λ2 qML (S →2 NP VP|S) 500 CHAPTER 17. NLP 501 17.8. TEXT CLASSIFICATION Again, λ1 , λ2 ≥ 0, λ1 + λ2 = 1. To clarify: Count(S(saw) →2 NP VP) Count(S(saw)) Count(S →2 NP VP) qML (S →2 NP VP|S) = Count(S) qML (S →2 NP VP|S, saw) = Here is the linear interpolation for the second parameter: q(man|S →2 NP VP, saw) = λ3 qML (man|S →2 NP VP, saw)+λ4 qML (man|S →2 NP VP)+λ5 qML (man|NP) Again, λ3 , λ4 , λ5 ≥ 0, λ3 + λ4 + λ5 = 1. To clarify, qML (man|NP) describes: given NP, what is the probability that its headword is man? This presentation of PCFGs do not deal with the close attachment issue as described earlier, though there are modified forms which do. 17.8 Text Classification The general text classification problem is given an input document d and a fixed set of classes C = {c1 , c2 , . . . , cj } output a predicted class c ∈ C. 17.8.1 Naive Bayes This supervised approach to classification is based on Bayes’ rule. It relies on a very simple representation of the document called “bag of words”, which is ignorant of the sequence or order of word occurrence (and other things), and only pays attention to their counts/frequency. So you can represent the problem with Bayes’ rule: P (c|d) = P (d|c)P (c) P (d) And the particular problem at hand is finding the class which maximizes P (c|d), that is: CMAP = argmaxc∈C P (c|d) = argmaxc∈C P (d|c)P (c) Where CMAP is the maximum a posteriori class. CHAPTER 17. NLP 501 17.8. TEXT CLASSIFICATION 502 Using our bag of words assumption, we represent a document as features x1 , . . . xn without concern for their order: CMAP = argmaxc∈C P (x1 , x2 , . . . , xn |c)P (c) We additionally assume conditional independence, i.e. that the presence of one word doesn’t have any impact on the probability of any other word’s occurrence: P (x1 , x2 , . . . , xn |c) = P (x1 |c) · P (x2 |c) · · · · · P (xn |c) And thus we have the multinomial naive bayes classifier: CNB = argmaxc∈C P (cj ) ∏ P (x|c) x∈X To calculate the prior probabilities, we use the maximum likelihood estimates approach: P (cj ) = doccount(C = cj ) Ndoc That is, the prior probability for a given class is the count of documents in that class over the total number of documents. Then, for words: count(wi , cj ) w ∈V count(w , cj ) P (wi |cj ) = ∑ That is, the count of a word in documents of a given class, over the total count of words in that class. To get around the problem of zero probabilities (for words encountered in test input but not in training, which would cause a probability of a class to be zero since the probability of a class is the joint probability of the words encountered), you can use Laplace smoothing (see above): count(wi , cj ) + 1 P (wi |cj ) = ∑ ( w ∈V count(w , cj )) + |V | Note that to avoid underflow (from multiplying lots of small probabilities), you may want to work with log probabilities (see above). In practice, even with all these assumptions, Naive Bayes can be quite good: • Very fast, low storage requirements 502 CHAPTER 17. NLP 503 • • • • 17.8. TEXT CLASSIFICATION Robust to irrelevant features (they tend to cancel each other out) Very good in domains with many equally important features Optimal if independence assumptions hold A good, dependable baseline for text classification 17.8.2 Evaluating text classification The possible outcomes are: • • • • true positive: correctly identifying something as true false positive: incorrectly identifying something as true true negative: correctly identifying something as false false negative: incorrectly identifying something as false The accuracy of classification is calculated as: accuracy = tp + tn tp + f p + f n + f n Though as a metric it isn’t very useful if you are dealing with situations where the correct class is sparse and most words you encounter are not in the correct class: Say you’re looking for a word that only occurs 0.01% of the time. you have a classifier you run on 100,000 docs and the word appears in 10 docs (so 10 docs are correct, 99,990 are not correct). but you can have that classifier classify all docs as not correct and get an amazing accuracy of 99,990/100,000 = 99.99% but the classifier didn’t actually do anything! So other metrics are needed. Precision measures the percent of selected items that are correct: precision = tp tp + f p Recall measures the percent of correct items that are selected: recall = tp tp = f n Typically, there is a trade off between recall and precision - the improvement of one comes at the sacrifice of the other. The F measure combines both precision and recall into a single metric: CHAPTER 17. NLP 503 17.9. TAGGING 504 F = (β 2 + 1)P R 1 = β2P + R α P1 + (1 − α) R1 Where α is a weighting value so you can assign more importance to either precision or recall. People usually use the balanced F1 measure, where β = 1 (that is, α = 1/2): F = 2P R P +R 17.9 Tagging A class of NLP problems in which we want to assign a tag to each word in an input sentence. • Part-of-speech tagging: Given an input sentence, output a POS tag for each word. Like in many NLP problems, ambiguity makes this a difficult task. • Named entity recognition: Given an input sentence, identify the named entities in the sentence (e.g. a company, or location, or person, etc) and what type the entity is (other words are tagged as non-entities). Entities can span multiple words, so there will often be “start” and “continue” tags (e.g. for “Wall Street”, “Wall” is tagged as “start company”, and “Street” is tagged as “continue company”). There are two types of constraints in tagging problems: • local: words with multiple meanings can have a bias (a “local preference”) towards one meaning (i.e. one meaning is more likely than the others) • contextual: certain meanings of a word are more likely in certain contexts These constraints can sometimes conflict. 17.9.1 Generative models One approach to tagging problems (and supervised learning in general) is to use a conditional model (often called a discriminative model), i.e. to learn the distribution p(y |x) and select argmaxy p(y |x) as the label. Alternatively, we can use a generative model which instead learns the distribution p(x, y ). We often have p(x, y ) = p(y )p(x|y ), where p(y ) is the prior and p(x|y ) is the conditional generative model. This is generative because we can use this to generate new sentences by sampling the distribution given the words we have so far. We can apply Bayes’ Rule as well to derive the conditional distribution as well: 504 CHAPTER 17. NLP 505 17.9. TAGGING p(y |x) = Where p(x) = ∑ y p(y )p(x|y ) p(x) p(y )p(x|y ). Again, we can select argmaxy p(y |x) as the label, but we can apply Bayes’ Rule to equivalently get )p(x|y ) argmaxy p(yp(x) . But note that p(x) does not vary with y (i.e. it is constant), so it does not affect the argmax, and we can just drop it to get argmaxy p(x)p(x|y ). 17.9.2 Hidden Markov Models (HMM) An example of a generative model. We have an input sentence x = x1 , x2 , . . . , xn where xi is the i th word in the sentence. We also have a tag sequence y = y1 , y2 , . . . , yn where yi is the tag for the i th word in the sentence. We can use a HMM to define the joint distribution p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ). Then the most likely tag sequence for x is argmaxy1 ,...,yn p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ). Trigram HMMs For any sentence x1 , . . . , xn where xi ∈ V for i = 1, . . . , n and any tag sequence y1 , . . . , yn+1 where yi ∈ S for i = 1, . . . , n and yn+1 = STOP (where S is the set of possible tags, e.g. DT, NN, VB, P, ADV, etc), the joint probability of the sentence and tag sequence is: p(x1 , . . . , xn , y1 , . . . , yn+1 ) = n+1 ∏ i=1 q(yi |yi−2 , yi−1 ) n ∏ e(xi |yi ) i=1 Again we assume that x0 = x−1 = ∗. The parameters for this model are: • q(s|u, v ) for any s ∈ S ∪ {STOP}, u, v ∈ S ∪ {∗} • e(x|s) for any s ∈ S, x ∈ V , sometimes called “emission parameters” The first product is the (second-order) Markov chain, quite similar to the trigram Markov chain used before for language modeling, and the e(xi |yi ) terms of the second product are what we have observed. Combined, these produce a hidden Markov model (the Markov chain is “hidden”, since we don’t observe the tag sequences, we only observe the xi s). CHAPTER 17. NLP 505 17.9. TAGGING 506 Parameter estimation in HMMs For the q(yi |yi−2 , yi−1 ) parameters, we can again use a linear interpolation with maximum likelihood estimates approach as before with the trigram language model. For the emission parameters, we can also use a maximum likelihood estimate: e(x|y ) = Count(y , x) Count(y ) However, we again have the issue that e(x|y ) = 0 for all y if we have never seen x in the training data. This will cause the entire joint probability p(x1 , . . . , xn , y1 , . . . , yn+1 ) to become 0. How do we deal with low-frequency words then? We can split the vocabulary into two sets: • frequent words: occurring ≥ t times in the training data, where t is some threshold (e.g. t = 5) • low-frequency words: all other words, including those not seen in the training data Then map low-frequency words into a small, finite set depending on textual features, such as prefixes, suffixes, etc. For example, we may map all all-caps words (e.g. IBM, MTA, etc) to a word class “allCaps”, and we may map all four-digit numbers (e.g. 1988, 2010, etc) to a word class “fourDigitNum”, or all first words of sentences to a word class “firstWord”, and so on. 17.9.3 The Viterbi algorithm We want to compute argmaxy1 ,...,yn p(x1 , x2 , . . . , xn , y1 , y2 , . . . , yn ), but we don’t want to do so via brute-force search. The search space is far too large, growing exponentially with n (the search space’s size is |S|n ). A more efficient way of computing this is to use the Viterbi algorithm: Define Sk for k = −1, . . . , n to be the set of possible tags at position k: S−1 = S0 = {∗} Sk = S∀k ∈ {1, . . . , n} Then we define: r (y−1 , y0 , y1 , . . . , yk ) = k ∏ i=1 q(yi |yi−2 , yi−1 ) k ∏ e(xi |yi ) i=1 This computes the probability from our HMM for a given sequence of tags, y−1 , y0 , y1 , . . . , yk , but only up to the kth position. 506 CHAPTER 17. NLP 507 17.10. NAMED ENTITY RECOGNITION (NER) We define a dynamic programming table: π(k, u, v ) as the maximum probability of a tag sequence ending in tags u, v at position k, i.e: π(k, u, v ) = max (y−1 ,y0 ,y1 ,...,yk ):yk−1 =u,yk =v r (y−1 , y0 , y1 , . . . , yk ) To clarify: k ∈ {1, . . . , n}, u ∈ Sk−1 , v ∈ Sk . For example: say we have the sentence “The man saw the dog with the telescope”, which we re-write as “START START The Man saw the dog with the telescope”. We’ll set Sk = {D, N, V, P } for k ≥ 1 and S−1 = S0 = {∗}. If we want to compute π(7, P, D), then k = 7 so then fix the 7th term with the D tag and the k − 1 term with the P tag. Then we consider all possible tag sequences (ending with P, D) up to the 7th term (e.g. ∗, D, N, V, P, P, P, D and so on) and get the probability of the most likely sequence. We can re-define the above recursively. The base case is π(0, ∗, ∗) = 1 since we always have the two START tokens tagged as ∗ at the beginning. Then, for any k ∈ {1, . . . , n} for any u ∈ Sk−1 and v ∈ Sk : π(k, u, v ) = max (π(k − 1, w , u)q(v |w , u)e(xk |v )) w ∈Sk−2 The Viterbi algorithm is just the application of this recursive definition while keeping backpointers to the tag sequences with max probability: • For k = 1, . . . , n – For u ∈ Sk−1 , v ∈ Sk * π(k, u, v ) = maxw ∈Sk−2 (π(k − 1, w , u)q(v |w , u)e(xk |v )) * bp(k, u, v ) = argmaxw ∈Sk−2 (π(k − 1, w , u)q(v |w , u)e(xk |v )) • Set (yn−1 , yn ) = argmax(u,v ) (π(n, u, v )q(STOP|u, v )) • For k = (n − 2), . . . , 1, yk = bp(k + 2, yk1 , yk+2 ) • Return the tag sequence y1 , . . . , yn It has the runtime O(n|S|3 ) because of the loop over k value (for k = 1, . . . , n, so this happens n times), then its inner loops over S twice (for u ∈ Sk−1 and for v ∈ Sk ), with each loop searching over |S|. 17.10 Named Entity Recognition (NER) Named entity recognition is the extraction of entities - people, places, organizations, etc - from a text. CHAPTER 17. NLP 507 17.11. RELATION EXTRACTION 508 Many systems use a combination of statistical techniques, linguistic parsing, and gazetteers to maximize detection recall & precision. Distant supervision and unsupervised techniques can also help with training, limiting the amount of gold-standard data necessary to build a statistical model. Boundary errors are common in NER: First Bank of Chicago announced earnings… Here, the extractor extracted “Bank of Chicago” when the correct entity is the “First Bank of Chicago”. A general NER approach is to use supervised learning: 1. 2. 3. 4. Collect a set of training documents Label each entity with its entity type or O for “other”. Design feature extractors Train a sequence classifier to predict the labels from the data. 17.11 Relation Extraction International Business Machines Corporation (IBM or the company) was incorporate in the State of New York on June 16, 1911, as the Computing-Tabulating-Recording Co. (C-T-R)… From such a text you could extract the following relation triples: Founder-year(IBM,1911) Founding-location(IBM,New York) These relations may be represented as resource description framework (RDF) triples in the form of subject predicate object. Golden Gate Park location San Francisco 17.11.1 Ontological Relations • IS-A describes a subsumption between classes, called a hypernum: Giraffe IS-A ruminant IS-A ungulate IS-A mammal IS-A vertebrate IS-A animal… • instance-of relation between individual and class San Francisco instance-of city There may be many domain-specific ontological relations as well, such as founded (between a PERSON and an ORGANIZATION), cures (between a DRUG and a DISEASE), etc. 508 CHAPTER 17. NLP 509 17.11. RELATION EXTRACTION 17.11.2 Methods Relation extractors can be built using: • handwritten patterns • supervised machine learning • semi-supervised and unsupervised – bootstrapping (using seeds) – distance supervision – unsupervised learning from the web Handwritten patterns • Advantages: – can take advantage of domain expertise – human patterns tend to be high-precision • Disadvantages: – human patterns are often low-recall – hard to capture all possible patterns Supervised • Advantages: – can get high accuracy if… * there’s enough hand-labeled training data * if the test is similar enough to training • Disadvantages: – labeling a large training set is expensive – don’t generalize well You could use classifiers: find all pairs of named entities, then use a classifier to determine if the two are related or not. Unsupervised If you have no training set and either only a few seed tuples or a few high-precision patterns, you can bootstrap and use the seeds to accumulate more data. The general approach is: 1. Gather a set of seed pairs that have a relation R CHAPTER 17. NLP 509 17.12. SENTIMENT ANALYSIS 510 2. Iterate: 1. 2. 3. 4. Find sentences with these pairs Look at the context between or around the pair Generalize the context to create patterns Use these patterns to find more pairs For example, say we have the seed tuple < Mark Twain, Elmira >. We could use Google or some other set of documents to search based on this tuple. We might find: • “Mark Twain is buried in Elmira, NY” • “The grave of Mark Twain is in Elmira” • “Elmira is Mark Twain’s final resting place” which gives us the patterns: • “X is buried in Y” • “The grave of X is in Y” • “Y is X’s final resting place” Then we can use these patterns to search and find more tuples, then use those tuples to find more patterns, etc. Two algorithms for this bootstrapping is the Dipre algorithm and the Snowball algorithm, which is a version of Dipre which requires the strings be named entities rather than any string. Another semi-supervised algorithm is distance supervision, which mixes bootstrapping and supervised learning. Instead of a few seeds, you use a large database to extract a large number of seed examples and go from there: 1. 2. 3. 4. 5. For each relation R For each tuple in a big database Find sentences in a large corpus with both entities of the tuple Extract frequent contextual features/patterns Train a supervised classifier using the extracted patterns 17.12 Sentiment Analysis In general, sentiment analysis involves trying to figure out if a sentence/doc/etc is positive/favorable or negative/unfavorable; i.e. detecting attitudes in a text. The attitude may be • a simple weighted polarity (positive, negative, neutral), which is more common 510 CHAPTER 17. NLP 511 17.12. SENTIMENT ANALYSIS • from a set of types (like, love, hate, value, desire, etc) When using multinomial Naive Bayes for sentiment analysis, it’s often better to use binarized multinomial Naive Bayes under the assumption that word occurrence matters more than word frequency: seeing “fantastic” five times may not tell us much more than seeing it once. So in this version, you would cap word frequencies at one. An alternate approach is to use log(f r eq(w )) instead of 1 for the count. However, sometimes raw word counts don’t work well either. In the case of IMDB ratings, the word “bad” appears in more 10-star reviews than it does in 2-star reviews! Instead, you’d calculate the likelihood of that word occurring in an n-star review: P (w |c) = ∑ f (w , c) w ∈C f (w , c) And then you’d used the scaled likelihood to make these likelihoods comparable between words: P (w |c) P (w ) 17.12.1 Sentiment Lexicons Certain words have specific sentiment; there are a variety of sentiment lexicons which specify those relationships. 17.12.2 Challenges Negation “I didn’t like this movie” vs “I really like this movie.” One way to handle negation is to prefix every word following a negation word with NOT_, e.g. “I didn’t NOT_like NOT_this NOT_movie”. “Thwarted Expectations” problem For example, a film review which talks about how great a film should be, but fails to live up to those expectations: This film should be brilliant. It sounds like a great plot, the actors are first grade, and the supporting cast is good as well, and Stallone is attempting to deliver a good performance. However, it can’t hold up. CHAPTER 17. NLP 511 17.13. SUMMARIZATION 17.13 512 Summarization Generally, sumamrization is about producing an abridged version of a text without or with minimal loss of important information. There are a few ways to categorize summarization problems. • Single-document vs multi-document summarization: summarizing a single document, yielding an abstract or outline or headline, or producing a gist of the content of multiple documents? • Generic vs query-focused summarization: give a general summary of the document, or a summary tailored to a particular user query? • Extractive vs abstractive: create a summary from sentences pulled from the document, or generate new text for the summary? Here, extractive summarization will be the focus (abstractive summarization is really hard). The baseline used in summarization, which often works surprisingly well, is just to take the first sentence of a document. 17.13.1 The general approach Summarization usually uses this process: 1. Content Selection: choose what sentences to use from the document. • You may weight salient words based on tf-idf, its presence in the query (if there is one), or based on topic signature. – For the latter, you can use log-likelihood ratio (LLR): 1 if − 2 log λ(wi ) > 10 w ei ght(wi ) = 0 otherwise • Weight a sentence (or a part of a sentence, i.e. a window) by the weights of its words: w eight(s) = 1 ∑ w ei ght(w ) |S| w ∈S • You can combine LLR with maximal marginal relevance (MMR), which is a greedy algorithm which selects sentences by their similarity to the query and by their dissimilarity (novelty) to already-selected sentences to avoid redundancy. 2. Information Ordering: choose the order for the sentences in the summary. • If you are summarizing documents with some chronological order to them, such as the news, then it makes sense to order sentences chronologically (if you are, for example, summarizing a set of news articles). 512 CHAPTER 17. NLP 513 17.14. MACHINE TRANSLATION • You can also use topical ordering, and order sentences by the order of topics in the source documents. • You can also use coherence: – Choose orderings that make neighboring sentences (cosine) similar. – Choose orderings in which neighboring sentences discuss the same entity. 3. Sentence Realization: clean up the sentences so that the summary is coherent or remove unnecessary content. You could remove: • appositives: “Rajam[, an artist living in Philadelphia], found inspiration in the back of city magazines.” • attribution clauses: “Sources said Wednesday” • initial adverbials: “For example”, “At this point” 17.14 17.14.1 Machine Translation Challenges in machine translation • lexical ambiguity (e.g. “bank” as financial institution, or as in a “river bank”) • differing word orders (e.g. English is subject-verb-object and Japanese is subject-object-verb) • syntactic structure can vary across languages (e.g. “The bottle floated into the cave” when translated into Spanish has the literal meaning “the bottle entered the cave floating”; the verb “floated” becomes an adverb “floating” modifying “entered”) • syntactic ambiguity (e.g. “John hit the dog with the stick” can have two different translations depending on whether “with the stick” attaches to “John” or to “hit the dog”) • pronoun resolution (e.g. “The computer outputs the data; it is stored in ASCII” - what is “it” referring to?) 17.14.2 Classical machine translation methods Early machine translation methods used direct machine translation, which involved translating wordby-word by using a set of rules for translating particular words. Once the words are translated, reordering rules are applied. But such rule-based systems quickly become unwieldy and fail to encompass the variety of ways words can be used in languages. There are also transfer-based approaches, which have three phases: 1. Analysis: analyze the source language sentence (e.g. a syntactic analysis to generate a parse tree) 2. Transfer: convert the source-language parse tree to a target-language parse tree based on a set of rules 3. Generation: convert the target-language parse tree to an output sentence CHAPTER 17. NLP 513 17.14. MACHINE TRANSLATION 514 Another approach is interlingua-based translation, which involves two phases: 1. Analysis: analyze the source language sentence into a language-independent representation of its meaning 2. Generation: convert the meaning representation into an output sentence 17.14.3 Statistical machine translation methods If we have parallel corpora (parallel meaning that they “line up”) for the source and target languages, we can use these as training sets for translation (that is, used a supervised learning approach rather than a rule-based one). The Noisy Channel Model The noisy channel model has two components: • p(e), the language model (trained from just the target corpus, could be, for example, a trigram model) • p(f |e), the translation model Where e is a target language sentence (e.g. English) and f is a source language sentence (e.g. French). We want to generate a model p(e|f ) which estimates the conditional probability of a target sentence e given the source sentence f . So we have the following, using Bayes’ Rule: p(e|f ) = p(e, f ) p(e)p(f |e) =∑ p(f ) e p(e)p(f |e) argmax p(e|f ) = argmax p(e)p(f |e) e e IBM translation models IBM Model 1 We want to model p(f |e), where e is the source language sentence with l words, and f is the target language sentence with m words. We say that an alignment a identifies which source word each target word originated from; that is, a = {a1 , . . . , am } where each aj ∈ {0, . . . , l}, and if aj = 0 then it does not align to any word. There are (l + 1)m possible alignments. Then we define models for p(a|e, m) (the distribution of possible alignments) and p(f |a, e, m), giving: 514 CHAPTER 17. NLP 515 17.14. MACHINE TRANSLATION p(f , a|e, m) = p(a|e, m)p(f |a, e, m) p(f |e, m) = ∑ p(a|e, m)p(f |a, e, m) a∈A Where A is the set of all possible alignments. We can also use the model p(f , a|e, m) to get the distribution of alignments given two sentences: p(f , a|e, m) a∈A p(f , a|e, m) p(a|f , e, m) = ∑ Which we can then use to compute the most likely alignment for a sentence pair f , e: a∗ = argmax p(a|f , e, m) a When we start, we assume that all alignments a are equally likely: p(a|e, m) = 1 (l + 1)m Which is a big simplification but provides a starting point. We want to estimate p(f |a, e, m), which is: p(f |a, e, m) = m ∏ t(fj |eaj ) j=1 Where t(fj |eaj ) is the probability of the source word eaj being aligned with fj . These are the parameters we are interested in learning. So the general generative process is as follows: 1 1. Pick an alignment a with probability (l+1) m 2. Pick the target language words with probability: p(f |a, e, m) = m ∏ t(fj |eaj ) j=1 Then we get our final model: p(f , a|e, m) = p(a|e, m)p(f |a, e, m) = CHAPTER 17. NLP m ∏ 1 t(fj |eaj ) (l + 1)m j=1 515 17.14. MACHINE TRANSLATION 516 IBM Model 2 An extension of IBM Model 1; it introduces alignment (also called distortion) parameters q(i |j, l, m), which is the probability that the jth target word is connected to the i th source word. That is, we no longer assume alignments have uniform probability. We define: m ∏ p(a|e, m) = q(aj |j, l, m) j=1 where a = {a1 , . . . , am }. This now gives us the following as our final model: p(f , a|e, m) = m ∏ q(aj |j, l, m)t(fj |eaj ) i=1 In overview, the generative process for IBM model 2 is: 1. Pick an alignment a = {a1 , a2 , . . . , am } with probability: m ∏ q(aj |j, l, m) j=1 2. Pick the target language words with probability: p(f , a|e, m) = m ∏ t(fj |eaj ) j=1 Which is equivalent to the final model described above. Then we can use this model to get the most likely alignment for any sentence pair: Given a sentence pair e1 , e2 , . . . , el and f1 , f2 , . . . , fm : aj = argmax q(a|j, l, m)t(fj |ea ) a∈{0,...,l} For j = 1, . . . , m. 516 CHAPTER 17. NLP 517 17.14. MACHINE TRANSLATION Estimating the q and t parameters We need to estimate our q(i |j, l, m) and t(f |e) parameters. We have a parallel corpus of sentence pairs, a single example of which is notated (e (k) , f (k) ) for k = 1, . . . , n. Our training examples do not have alignments annotated (if we did, we could just use maximum ) Count(j|i,l,m) likelihood estimates, e.g. tML (f |e) = Count(e,f Count(e) and qML (j|i, l, m) = Count(i,l,m) ). We can use the Expectation Maximization algorithm to estimate these parameters. We initialize our q and t parameters to random values. Then we iteratively do the following until convergence: 1. Compute “counts” (called expected counts) based on the data and our current parameter estimates 2. Re-estimate the parameters with these counts The amount we increment counts by is: q(j|i , lk , mk )t(fi (k) |ej(k) ) δ(k, i , j) = ∑l k j=0 q(j|i, lk , mk )t(fi (k) |ej(k) ) The algorithm for updating counts c is: • For k = 1, . . . , n • For i = 1, . . . , mk , for j = 0, . . . , lk – – – – c(ej(k) , fi (k) )+ = δ(k, i , j) c(ej(k) )+ = δ(k, i, j) c(j|i , l, m)+ = δ(k, i , j) c(i, l, m)+ = δ(k, i , j) Then recalculate the parameters: c(e, f ) c(e) c(j|i, l, m) q(j|i, l, m) = c(i , l, m) t(f |e) = How does this method work? First we define the log-likelihood function as a function of our t and q parameters: L(t, q) = n ∑ k=1 log p(f (k) |e (k) ) = n ∑ k=1 log ∑ p(f (k) a|e (k) ) a Which quantifies how well our current parameter estimates fit the data. CHAPTER 17. NLP 517 17.14. MACHINE TRANSLATION 518 So the maximum likelihood estimates are just: argmax L(t, q) t,q Though the EM algorithm will converge only to a local maximum of the log-likelihood function. 17.14.4 Phrase-Based Translation Phrase-based models must extract a phrase-based (PB) lexicon, which consists of pairs of matching phrases (consisting of one or more words), one from the source language, from the target language. This phrase lexicon can be learned from alignments. However, alignments are many-to-one; that is, multiple words in the target language can map to a single word in the source language, but the reverse cannot happen. A workaround is to learn alignments in both ways (i.e. from source to target and from target to source), then look at the intersections of these alignments as (a starting point) the phrase lexicon. This phrase lexicon can be expanded (“grown”) through some heuristics (not covered here). This phrase lexicon can be noisy, so we want to apply some heuristics to clean it up. In particular, we want phrase pairs that are consistent. A phrase pair (e, f ) is consistent if: 1. There is at least one word in e aligned to a word in f 2. There are no words in f aligned to words outside e 3. There are no words in e aligned to words outside f We discard any phrase pairs that are not consistent. We can use these phrases to estimate the parameter t(f |e) easily: t(f |e) = Count(f , e) Count(e) We give each phrase pair (f , e) a score g(f , e). For example: g(f , e) = log( Count(f , e) ) Count(e) A phrase-based model consists of: • a phrase-based lexicon, with a way of computing a score for each phrase pair • a trigram language model with parameters q(w |u, v ) • a distortion parameter η, which is typically negative 518 CHAPTER 17. NLP 519 17.14. MACHINE TRANSLATION Given an input (source language) sentence x1 , . . . , xn , a phrase is a tuple (s, t, e) which indicates that the subsequence xs , . . . , xt can be translated to the string e in the target language using a phrase pair in the lexicon. We denote P as the set of all phrases for a sentence. For any phrase p, s(p), t(p), e(p) correspond to its components in the tuple. g(p) is the score for the phrase. A derivation y is a finite sequence of phrases p1 , p2 , . . . , pL where each phrase is in P . The underlying translation defined by y is denoted e(y ) (that is, e(y ) just represents the combined string of y ’s phrases). For an input sentence x = x1 , . . . , xn , we refer to the set of valid derivations for x as Y (x). It is a set of all finite length sequences of phrases p1 , p2 , . . . , pL such that: • Each phrase pk , k ∈ {1, . . . , L} is a member of the set of phrases P • Each word in x is translated exactly once • For all k ∈ {1, . . . , (L − 1)}, |t(pk ) + 1 − s(pk+1 )| ≤ d where d ≥ 0 is a parameter of the model. We must also have |1 − s(p1 ) ≤ d. d is the distortion limit which constrains how far phrases can move (a typical value is d = 4). Empirically, this results in better translations, and it also reduces the search space of possible translations. Y (x) is exponential in size (it grows exponentially with sentence length), so it gets quite large. Now we want to score these derivations and select the highest-scoring one as the translation, i.e. argmax f (y ) y ∈Y (x) Where f (y ) is the scoring function. It typically involves a product of a language model and a translation model. In particular, we have the scoring function: f (y ) = h(e(y )) + L ∑ k=1 g(pk ) + L−1 ∑ η|t(pk ) + 1 − s(pk+1 ) k=0 Where e(y ) is the sequence of words in the translation, h(e(y )) is the score of the sequence of words under the language model (e.g. a trigram language model), g(pk ) is the score for the phrase pk , and the last summation is the distortion score, which penalizes distortions (so that we favor smaller distortions). We also define t(p0 ) = 0. Because Y (x) is exponential in size, we want to avoid a brute-force method for identifying the highest-scoring derivation. In fact, it is an NP-Hard problem, so we must apply a heuristic method in particular, using beam search. CHAPTER 17. NLP 519 17.14. MACHINE TRANSLATION 520 For this algorithm, called the Decoding Algorithm, we keep a state as a tuple (e1 , e2 , b, r, α) where e1 , e2 are target words, b is a bit-string of length n (that is, the same length of the input sentence) which indicates which words in the source sentence have been translated, r is the integer specifying the endpoint of the last phrase in the state, and α is a score for the state. The initial state is q0 = (∗, ∗, 0n , 0, 0), where 0n is a bit-string of length n with all zeros. We can represent the state of possible translations as a graph of these states, e.g. the source sentence has many initial possible translation states, which each also lead to many other possible states, etc. As mentioned earlier, this graph becomes far too large to brute-force search through. We define ph(q) as a function which returns the set of phrases that can follow state q. For a phrase p to be a member of ph(q), it must satisfy the following: • p must not overlap with the bit-string b, i.e. bi = 0 for i ∈ {s(p), . . . , t(p)}. This formalizes the fact that we don’t want to translate the same word twice. • The distortion limit must not be violated (i.e. |r + 1 − s(p)| ≤ d) We also define next(q, p) to be the state formed by combining the state q with the phrase p (i.e. it is a transition function for the state graph). Formally, we have a state q = (e1 , e2 , b, r, α) and a phrase p = (s, t, ϵ1 , . . . , ϵM ) where ϵi is a word in the phrase. The transition function next(q, p) yields the state $q’ = (e_1’, e_2’, b’, r’, alpha’), defined as follows: • • • • • Define ϵ−1 = e1 , ϵ0 = e2 Define e1′ = ϵM−1 , e2′ = ϵM Define bi′ = 1 for i ∈ {s, . . . , t}. Define bi′ = bi for i ∈ / {s, . . . , t}. Define r ′ = t Define: α′ = α + g(p) + M ∑ log q(ϵi |ϵi−2 , ϵi−1 ) + η|r + 1 − s| i=1 We also define a simple equality function, eq(q, q ′ ) which returns true or false if the two states are equal, ignoring scores (that is, if all their components are equal, without requiring that their scores are equal). The final decoding algorithm: • Inputs: • a sentence x1 , . . . , xn • a phrase-based model (L, h, d, η), where L is the lexicon, h is the language model, d is the distortion limit, and η is the distortion parameter. This model defines the functions ph(q) and next(q, p). 520 CHAPTER 17. NLP 521 17.15. WORD CLUSTERING • Initialization: set Q0 = {q0 }, Qi = ∅ for i = 1, . . . , n, where q0 is the initial state as defined earlier. Each Qi contains possible states in which i words are translated. • For i = 0, . . . , n − 1 • For each state q ∈ beam(Qi ), for each phrase p ∈ ph(q): – q ′ = next(q, p) – Add Add(Qi , q ′ , q, p) where i = len(q ′ ) • Return: highest scoring state in Qn . Backpointers can be used to find the underlying sequence of phrases. Add(Q, q ′ , q, p) is defined: • If there is some q ′′ ∈ Q such that eq(q ′′ , q) is true: • if α(q ′ ) > α(q ′′ ) – Q = {q ′ } ∪ Q {q ′′ } (remove the lower scoring state, add the higher scoring one) – set bp(q ′ ) = (q, p) • • • • else return Else Q = Q ∪ {q ′ } set bp(q ′ ) = (q, p) That is, if we already have an equivalent state, keep the higher scoring of the two, and we keep a backpointer of how we got there. beam(Q) is defined: First define α∗ = argmaxq∈Q α(q). We define β ≥ 0 to be the beam-width parameter. Then beam(Q) = {q ∈ Q : α(q) ≥ α ∗ −β} 17.15 Word Clustering The Brown clustering algorithm is an unsupervised method which take as input some large quantity of sentences, and from that, learns useful representations of words, outputting a hierarchical word clustering (e.g. weekdays and weekends might be clustered together, months may be clustered together, family relations may be clustered, etc). The general intuition is that similar words appear in similar contexts - that is, they have similar distributions of words to their immediate left and right. We have a set of all words seen in the corpus V = {w1 , w2 , . . . , wT }. Say C : V → {1, 2, . . . , k} is a partition of the vocabulary into k classes (that is, C maps each word to a class label). The model is as follows, where C(w0 ) is a special start state: p(w1 , w2 , . . . , wT ) = n ∏ e(wi |C(wi ))q(C(wi )|C(wi−1 )) i=1 CHAPTER 17. NLP 521 17.15. WORD CLUSTERING 522 Which can be restated: log p(w1 , w2 , . . . , wT ) = n ∑ log e(wi |C(wi ))q(C(wi )|C(wi−1 )) i=1 So we want to learn the parameters e(v |c) for every v ∈ V, c ∈ {1, . . . , k} and q(c ′ |c) for every c ′ , c ∈ {1, . . . , k}. We first need to measure the quality of a partition C: Quality(C) = = n ∑ log e(wi |C(wi ))q(C(wi )|C(wi−1 )) i=1 k ∑ k ∑ p(c, c ′ ) log c=1 c ′ =1 p(c, c ′ ) +G p(c)p(c ′ ) Where G is some constant. This basically computes the likelihood of this corpus under C. Here: n(c, c ′ ) n(c) , p(c) = ∑ ′ c,c ′ n(c, c ) c n(c) p(c, c ′ ) = ∑ Where n(c) is the number of times class c occurs in the corpus, n(c, c ′ ) is the number of times c ′ is seen following c, under the function C. The basic algorithm for Brown clustering is as follows: • • • • Start with |V | clusters (each word gets its own cluster, but by the end we will find k clusters) We run |V | − k merge steps: At each merge step, we pick two clusters ci , cj and merge them into a single cluster We greedily pick merges such that Quality(C) for the clustering C after the merge step is maximized at each stage This approach is inefficient: O(|V |5 ) though it can be improved to O(|V |3 ), which is still quite slow. There is a better way based on this approach: • • • • • We specify a parameter m, e.g. m = 1000 We take the top m most frequent words and puts each into its own cluster, c1 , c2 , . . . , cm . For i = (m + 1) . . . |V | Create a new cluster cm+1 for the i th most frequent word. We now have m + 1 clusters Choose two clusters from c1 , . . . , cm+1 to be merged, picking the merge that gives a max value for Quality(C) (now we just have m clusters again) • Carry out (m − 1) final merges to create a full hierarchy. This has the run time of O(|V |m2 + n), where n is the corpus length. 522 CHAPTER 17. NLP 523 17.16 17.16. NEURAL NETWORKS AND NLP Neural Networks and NLP Typically when words are represented as vectors, it is as a one-hot representation, that is, a vector of length |V | where V is the vocabulary, with all elements 0 except for the one corresponding to the particular word being represented (that is, it is a sparse representation). This can be quite unwieldy as it has dimensionality of |V |, which is typically quite large. We can instead use neural networks to learn dense representations of words (“word embeddings”) of a fixed dimension (the particular dimensionality is specified as a hyperparameter, there is not as of this time a theoretical understanding of how to choose this value) and can capture other properties of words (such as analogies). Representing a sentence can be accomplished by concatenating the embeddings of its words, but this can be problematic in that typically fixed-size vectors are required, and sentences are variable in their word length. A way around this is to use the continuous bag of words (CBOW) representation, in which, like the traditional bag-of-words representation, we throw out word order information and combine the embeddings by summing or averaging them, e.g. given a set of word embeddings v1 , . . . , vk : CBOW(v1 , . . . , vk ) = k 1∑ vi k i=1 An extension of this method is the weighted CBOW (WCBOW) which is just a weighted average of the embeddings. How are these word embeddings learned? Typically, it is by training a neural network (specifically for learning the embeddings) on an auxiliary task. For instance, context prediction is a common embedding training task, in which we try to predict a word given its surrounding context (under the assumption that words which appear in similar contexts are similar in other important ways). 17.16.1 Word Embeddings A word embedding W : words → Rn is a parameterized function that maps words to high-dimensional vectors (typically 200-500 dimensions). This function is typically a lookup table parameterized by a matrix θ, where each row represents a word. That is, the function is often Wθ (wn ) = θn . θ is initialized with random vectors for each word. So given a task involving words, we want to learn W so that we have good representations for each word. You can visualize a word embedding space using t-SNE (a technique for visualizing high-dimensional data): As you can see, words that are similar in meaning tend to be closer together. Intuitively this makes sense - if words have similar meaning, they are somewhat interchangeable, so we expect that their vectors be similar too. CHAPTER 17. NLP 523 17.16. NEURAL NETWORKS AND NLP 524 Visualizing a word embedding space with t-SNE (Turian et al (2010)) We’ll also see the vectors capture notions of analogy, for example “Paris” is to “France” as “Tokyo” is to “Japan”. These kinds of analogies can be represented as vector addition: “Paris” - “France” + “Japan” = “Tokyo”. The best part is the neural network is not explicitly told to learn representations with these properties - it is just a side effect. This is one of the remarkable properties of neural networks - they learn good ways of representing the data more or less on their own. And these representations can be portable. That is, maybe you learn W for one natural language task, but you may be able to re-use W for another natural language task (provided it’s using a similar vocabulary). This practice is sometimes called “pretraining” or “transfer learning” or “multi-task learning”. You can also map multiple words to a single representation, e.g. if you are doing a multilingual task. For example, the English and French words for “dog” could map to the same representation since they mean the same thing (in which case we could call this a “bilingual word embedding”). Here’s an example visualization of a Chinese and English bilingual word embedding: You can even go a step further and learn image and word representations together, so that vectors representing images of horses are close to the vector for the word “horse”. Two main techniques for learning word embeddings are: • CBOW: predicting the probability of context words given a word • Skip-gram: predicting the probability of a word given context words 17.16.2 CNNs for NLP CBOW representations lose word-ordering information, which can be important for some tasks (e.g. sentiment analysis). CNNs are useful in such situations because they avoid the need of going to, for instance, bigram methods. They can automatically learn important local structures (much as they do with image recognition). 524 CHAPTER 17. NLP 525 17.16. NEURAL NETWORKS AND NLP A Chinese and English word embedding (Socher et al (2013a)) CHAPTER 17. NLP 525 17.16. NEURAL NETWORKS AND NLP 17.16.3 526 References • A Primer on Neural Network Models for Natural Language Processing. Yoav Goldberg. October 5, 2015. • Natural Language Processing. Dan Jurafsky, Christopher Manning, Stanford (Coursera). • Natural Language Processing. Michael Collins. Columbia University (Coursera). 526 CHAPTER 17. NLP 527 18 Unsupervised Learning In unsupervised learning, our data does not have any labels. Unsupervised learning algorithms try to find some structure in the data. An example is a clustering algorithm. We don’t tell the algorithm in advance anything about the structure of the data; it discovers it on its own by figuring how to group them. Some other examples are dimensionality reduction, in which you try to reduce the dimensionality of the data representation, density estimation, in which you estimate the probability distribution of the data, p(x), and feature extraction, in which you try to learn meaningful features automatically. 18.1 k-Nearest Neighbors (kNN) A very simple nonparametric classification algorithm in which you take the k closest neighbors to a point (“closest” depends on the distance metric you choose) and each neighbor constitutes a “vote” for its label. Then you assign the point the label with the most votes. Because this is essentially predicting an input’s label based on similar instances, kNN is a case-based approach. The key with case-based approaches is how you define similarity - a common way is feature dot products: sim(x, x ′ ) = x · x ′ = ∑ xi xi′ i k can be chosen heuristically: generally you don’t want it to be so high that the votes become noisy (in the extreme, if you have n datapoints and set k = n, you will just choose the most common label in the dataset), and you want to chose it so that it is coprime with the number of classes (that is, they share no common divisors except for 1). This prevents ties. Alternatively, you can apply an optimization algorithm to choose k. CHAPTER 18. UNSUPERVISED LEARNING 527 18.2. CLUSTERING 528 Some distances that you can use include Euclidean distance, Manhattan distance (also known as the city block distance or the taxicab distance), Minkowski distance (a generalization of the Manhattan and Euclidean distances), and Mahalanobis distance. Minkowski-type distances assume that data is symmetric; that in all dimensions, distance is on the same scale. Mahalanobis distance, on the other hand, takes into account the standard deviation of each dimension. kNN can work quite quickly when implemented with something like a k-d tree. kNN and other case-based approaches are examples of nonparametric models. With nonparametric models, there is not a fixed set of parameters (which isn’t to say that there are no parameters, though the name “nonparametric” would have you think otherwise). Rather, the complexity of the classifier increases with the data. Nonparametric models typically require a lot of data before they start to be competitive with parametric models. 18.2 Clustering 18.2.1 K-Means Clustering Algorithm First, randomly initialize K points, called the cluster centroids. Then iterate: • Cluster assignment step: go through each data point and assign it to the closest of the K centroids. • Move centroid step: move the centroids to the average of their points. Closeness is computed by some distance metric, e.g. euclidean. More formally, there are two inputs: • K - the number of clusters • The training set x (1) , x (2) , . . . , x (m) Where x (i) ∈ Rn (we drop the x0 = 1 convention). Randomly initialize K cluster centroids µ1 , µ2 , . . . , µK ∈ Rn . Repeat: • For i = 1 to m – c (i) := index (from 1 to K) of cluster centroid closest to x (i) . That is, c (i) := mink ||x (i) − µk ||. • For k = 1 to K 528 CHAPTER 18. UNSUPERVISED LEARNING 529 18.2. CLUSTERING – µk := average (mean) of points assigned to cluster k If you have an empty cluster, it is common to just eliminate it entirely. We can notate the cluster centroid of the cluster to which example x (i) has been assigned as µc (i) . In K-means, the optimization objective is: J(c (i) , . . . , c (m) , µ1 , . . . , µK ) = m 1 ∑ ||x (i) − µc (i) ||2 m i=1 mi nc (i) ,...,c (m) ,µ1 ,...,µK J(c (i) , . . . , c (m) , µ1 , . . . , µK ) This cost function is sometimes called the distortion cost function or the distortion of the K-means algorithm. The algorithm outlined above is minimizing the cost: the first step tries to minimize c (i) , . . . , c (m) and the second step tries to minimize µ1 , . . . , µK . Note that we randomly initialize the centroids, so different runs of K-means could lead to (very) different clusterings. One question is - what’s the best way to initialize the initial centroids to avoid local minima of the cost function? First of all, you should halve K < m (i.e. less than your training examples.) Then randomly pick K training examples. Use these as your initialization points (i.e. set µ1 , . . . , µk to these K examples). Then, to better avoid local optima, just rerun K-means several times (e.g. 50-1000 times) with new initializations of points. Keep track of the resulting cost function and then pick the clustering that gave the lowest cost. So, how do you choose a good value for K? Unfortunately, there is no good way of doing this automatically. The most common way is to just choose it manually by looking at the output. If you plot out the data and look at it - even among people it is difficult to come to a consensus on how many clusters there are. One method that some use is the Elbow method. In this approach, you vary K, run K-means, and compute the cost function for each value. If you plot out K vs the cost functions, there may be a clear “elbow” in the graph and you pick the K at the elbow. However, most of the time there isn’t a clear elbow, so the method is not very effective. One drawback of K-means (which many other clustering algorithms share) is that every point has a cluster assignment, which is to say K-means has no concept of “noise”. Furthermore, K-means expects clusters to be globular, so it can’t handle more exotic cluster shapes (such as moon-shaped clusters). There are still many situations where K-means is quite useful, especially since it scales well to large datasets. CHAPTER 18. UNSUPERVISED LEARNING 529 18.2. CLUSTERING 18.2.2 530 Hierarchical Agglomerative Clustering Hierarchical agglomerative clustering (HAC) is a bottom-up clustering process which is fairly simple: 1. Find two closest data points or clusters, merge into a cluster (and remove the original points or clusters which formed the new cluster) 2. Repeat This results in a hierarchy (e.g. a tree structure) describing how the data can be grouped into clusters and clusters of clusters. This structure can be visualized as a dendrogram: 0 12 3 4 5 6 78 9 A dendrogram Two things which must be specified for HAC are: • the distance metric: euclidean, cosine, etc • the merging approach - that is, how is the distance between two clusters measured? – – – – complete linkage - use the distance between the two further points average linkage - take the average distances of all pairs between the clusters single linkage - take the distance between the two nearest points (there are others as well) Unlike K-means, HAC is deterministic (since there are no randomly-initialized centroids) but it can be unstable: changing a few points or the presence of some outliers can vastly change the result. Scaling of variables/features can also affect clustering. HAC does not assume globular clusters, although it does not have a concept of noise. 18.2.3 Affinity Propagation In affinity propagation, data points “vote” on their preferred “exemplar”, which yields a set of exemplars as the initial cluster points. Then we just assign each point to the nearest exemplar. Affinity Propagation is one of the few clustering algorithms which supports non-metric dissimilarities (i.e. the dissimilarities do not need to be symmetric or obey the triangle inequality). 530 CHAPTER 18. UNSUPERVISED LEARNING 531 18.2. CLUSTERING Like K-means, affinity propagation also does not have a concept of noise and also assumes that clusters are globular. Unlike K-means, however, it is deterministic, and it does not scale very well (mostly because its support for non-metric dissimilarities precludes it from many optimizations that other algorithms can take advantage of). 18.2.4 Spectral Clustering With spectral clustering, datapoints are clustered by affinity - that is, by nearby points - rather than by centroids (as is with K-Means). Using affinity instead of centroids, spectral clustering can identify clusters where K-Means fails to. In spectral clustering, an affinity matrix is produced which, for a set of n datapoints, is an n × n matrix. Pairwise affinities are computed for the dataset. Affinity is some distance metric. Then, from this affinity matrix, PCA is used to extract the eigenvectors with the largest eigenvalues and the data is then projected to the new space defined by PCA. The data will be more clearly separated in this new representation such that conventional clustering methods (e.g. K-Means) can be applied. More formally: spectral clustering generates a graph of the datapoints, with edges as the distances between the points. Then the Laplacian of the graph is produced: Given the adjacency matrix A and the degree matrix D of a graph G of n vertices, the Laplacian matrix Ln×n is simply L = D − A. As a reminder: • the adjacency matrix A is an n × n matrix where the element Ai,j is 1 if an edge exists between vertices i and j and 0 otherwise. • the degree matrix D is an n × n diagonal matrix where the element Di,i is the degree of vertex i. Then the eigenvectors of the Laplacian are computed to find an embedding of the graph into Euclidean space. Then some clustering algorithm (typically K-Means) is run on the data in this transformed space. Spectral clustering enhances clustering algorithms which assume globular clusters in that its space transformation of the data causes non-globular data to be globular in the transformed space. However, the graph transformation slows things down. 18.2.5 Mean Shift Clustering Mean shift clustering extends KDE one step further: the data points iteratively hill-climb to the peak of nearest KDE surface. As a parameter to the kernel density estimates, you need to specify a bandwidth - this will affect the KDEs and their peaks, and thus it will affect the clustering results. You do not, however, need to specify the number of clusters. CHAPTER 18. UNSUPERVISED LEARNING 531 18.2. CLUSTERING 532 Below are some examples of different bandwidth results (source). You also need to make the choice of what kernel to use. Two commonly used kernels are: • Flat kernel: 532 CHAPTER 18. UNSUPERVISED LEARNING 533 18.2. CLUSTERING 1 K(x) = if ||x|| ≤ 1 0 otherwise • Gaussian kernel Gaussian kernel Mean shift is slow (O(N 2 )). 18.2.6 Non-Negative Matrix Factorization (NMF) NMF is a particular matrix factorization in which each element of V is ≥ 0 (a non-negative constraint), and results in factor matrices W and H such that each of their elements are also ≥ 0. Non-negative matrix factorization (By Qwertyus, CC BY-SA 3.0, via Wikimedia Commons) CHAPTER 18. UNSUPERVISED LEARNING 533 18.2. CLUSTERING 534 Each column vi in V can be calculated from W and H like so (where hi is a column in H): vi = W hi NMF can be used for clustering; it has a consequence of naturally clustering the columns of V . It is also useful for reducing (i.e. compressing) the dimensionality of a dataset, in particular, it reduces it into a linear combination of bases. If you add an orthogonality constraint, i.e. HH T = I, if the value at Hkj > 0, then the jth column of V , that is, vj , belongs to the cluster k. Matrix factorization Let V be an m × n matrix of rank r . Then there is an m × r matrix W and an r × n matrix H such that V = W H. So we can factorize (or decompose) V into W and H. This matrix factorization can be seen as a form of compression (for low rank matrices, at least) - if we were to store V on its own, we have to store m × n elements, but if we store W and H separately, we only need to store m × r + r × n elements, which will be smaller than m × n for low rank matrices. Note that this kind of factorization can’t be solved analytically, so it is usually approximated numerically (there are a variety of algorithms for doing so). 18.2.7 DBSCAN DBSCAN transforms the space according to density, then identifies for dense regions as clusters by using single linkage clustering. Sparse points are considered noise - not all points are forced to have cluster assignment. DBSCAN handles non-globular clusters well, provided they have consistent density - it has some trouble with variable density clusters (they may be split up into multiple clusters). 18.2.8 HDBSCAN HDBSCAN is an improvement upon DBSCAN which can handle variable density clusters, while preserving the scalability of DBSCAN. DBSCAN’s epsilon parameter is replaced with a “min cluster size” parameter. HDBSCAN uses single-linkage clustering, and a concern with single-linkage clustering is that some errant point between two clusters may accidentally act as a bridge between them, such that they are identified as a single cluster. HDBSCAN avoids this by first transforming the space in such a way that sparse points (these potentially troublesome noise points) are pushed further away. To do this, we first define a distance called the core distance, corek (x), which is point x’s distance from its kth nearest neighbor. 534 CHAPTER 18. UNSUPERVISED LEARNING 535 18.2. CLUSTERING Then we define a new distance metric based on these core distances, called mutual reachability distance. The mutual reachability distance dmreach−k between points a and b is the furthest of the following points: corek (a), corek (b), d(a, b), where d(a, b) is the regular distance metric between a and b. More formally: dmreach−k (a, b) = max(corek (a), corek (b), d(a, b)) For example, if k = 5: Then we can pick another point: And another point: Say we want to compute the mutual reachability distance between the blue b and green g points. First we can compute d(b, g): Which is larger than corek (b), but both are smaller than corek (g). So the mutual reachability distance between b and g is corek(g): On the other hand, the mutual reachability distance between the red and green points is equal to d(r, g) because that is larger than either of their core distances. We build a distance matrix out of these mutual reachability distances; this is the transformed space. We can use this distance matrix to represent a graph of the points. We want to construct a minimum spanning tree out of this graph. CHAPTER 18. UNSUPERVISED LEARNING 535 18.2. CLUSTERING 536 536 CHAPTER 18. UNSUPERVISED LEARNING 537 CHAPTER 18. UNSUPERVISED LEARNING 18.2. CLUSTERING 537 18.2. CLUSTERING 538 As a reminder, a spanning tree of a graph is any subgraph which contains all vertices and is a tree (a tree is a graph where vertices are connected by only one path; i.e. it is a connected graph - all vertices are connected - but there are no cycles). The weight of a tree is the sum of its edges’ weights. A minimum spanning tree is a spanning tree with the least (or equal to least) weight. The minimum spanning tree of this graph can be constructed using Prim’s algorithm. From this spanning tree, we then want to create the cluster hierarchy. This can be accomplished by sorting edges from closest to furthest and iterating over them, creating a merged cluster for each edge. (A note from the original post which I don’t understand yet: “The only difficult part here is to identify the two clusters each edge will join together, but this is easy enough via a union-find data structure.”) Given this hierarchy, we want a set of flat clusters. DBSCAN asks you to specify the number of clusters, but HDBSCAN can independently discover them. It does require, however, that you specify a minimum cluster size. In the produced hierarchy, it is often the case that a cluster splits into one large subcluster and a few independent points. Other times, the cluster splits into two good-sized clusters. The minimum cluster size makes explicit what a “good-sized” cluster is. If a cluster splits into clusters which are at or above the minimum cluster size, we consider them to be separate clusters. Otherwise, we don’t split the cluster (we treat the other points as having “fallen out of” the parent cluster) and just keep the parent cluster intact. However, we keep track of which points have “fallen out” and at what distance that happened. This way we know at which distance cutoffs the cluster “sheds” points. We also keep track at what distances a cluster split into its children clusters. Using this approach, we “clean up” the hierarchy. We use the distances at which a cluster breaks up into subclusters to measure the persistence of a 1 cluster. Formally, we think in terms of λ = distance . We define for each cluster a λbirth , which is the distance at which this cluster’s parent split to yield this cluster, and a λdeath , which is the distance at which this cluster itself split into subclusters (if it does eventually split into subclusters). Then, for each point p within a cluster, we define λp to be when that point “fell out” of the cluster, which is either somewhere in between λbirth , λdeath , or, if the point does not fall out of the cluster, it is just λdeath (that is, it falls out when the cluster itself splits). The stability of a cluster is simply: ∑ (λp − λbirth ) p∈cluster Then we start with all the leaf nodes and select them as clusters. We move up the tree and sum the stabilities of each cluster’s child clusters. Then: 538 CHAPTER 18. UNSUPERVISED LEARNING 539 18.3. REFERENCES • If the sum of cluster’s child stabilities greater than its own stability, then we set its stability to be the sum of its child stabilities. • If the sum of a cluster’s child stabilities is less than its own stability, then we select the cluster and unselect its descendants. When we reach the root node, return the selected clusters. Points not in any of the selected clusters are considered noise. As a bonus: each λp in the selected clusters can be treated as membership strength to the cluster if we normalize them. 18.2.9 CURE (Clustering Using Representatives) If you are dealing with more data than can fit into memory, you may have issues clustering it. A flexible clustering algorithm (there are no restrictions about the shape of the clusters it can find) which can handle massive datasets is CURE. CURE uses Euclidean distance and generates a set of k representative points for each clusters. It uses these points to represent clusters, therefore avoiding the need to store every datapoint in memory. CURE works in two passes. For the first pass, a random sample of points from the dataset are chosen. The more samples the better, so ideally you choose as many samples as can fit into memory. Then you apply a conventional clustering algorithm, such as hierarchical clustering, to this sample. This creates an initial set of clusters to work with. For each of these generated clusters, we pick k representative points, such that these points are as dispersed as possible within the cluster. For example, say k = 4. For each cluster, pick a point at random, then pick the furthest point from that point (within the same cluster), then pick the furthest point (within the same cluster) from those two points, and repeat one more time to get the fourth representative point. Then copy each representative point and move that copy some fixed fraction (e.g. 0.2) closer to the cluster’s centroid. These copied points are called “synthetic points” (we use them so we don’t actually move the datapoints themselves). These synthetic points are the representatives we end up using for each cluster. For the second pass, we then iterate over each point p in the entire dataset. We assign p to its closest cluster, which is the cluster that has the closest representative point to p. 18.3 References • How HDBSCAN Works. Leland McInnes. • Thoughtful Machine Learning. Matthew Kirk. 2015. • Comparing Python Clustering Algorithms. Leland McInnes. CHAPTER 18. UNSUPERVISED LEARNING 539 18.3. REFERENCES 540 • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Mean Shift Clustering. Matt Nedrich. • Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman. • Example of matrix factorization.. MH1200. • Non-negative matrix factorization. Wikipedia. 540 CHAPTER 18. UNSUPERVISED LEARNING 541 19 In Practice 19.1 Machine Learning System Design Before you start building your machine learning system, you should: • Be explicit about the problem. – Start with a very specific and well-defined question: what do you want to predict, and what do you have to predict it with? • Brainstorm some possible strategies. – What features might be useful? – Do you need to collect more data? • Try and find good input data – Randomly split data into: * training sample * testing sample * if enough data, a validation sample too • Use features of or features built from the data that may help with prediction Then to start: • Start with a simple algorithm which can be implemented quickly. – Apply a machine learning algorithm – Estimate the parameters for the algorithm on your training data • Test the simple algorithm on your validation data, evaluate the results CHAPTER 19. IN PRACTICE 541 19.2. MACHINE LEARNING DIAGNOSTICS 542 • Plot learning curves to decide where things need work: – Do you need more data? – Do you need more features? – And so on. • Error analysis: manually examine the examples in the validation set that your algorithm made errors on. Try to identify patterns in these errors. Are there categories of examples that the model is failing on in particular? Are there any other features that might help? If you have an idea for a feature which may help, it’s best to just test it out. This process is much easier if you have a single metric for your model’s performance. 19.2 Machine learning diagnostics In machine learning, a diagnostic is: A test that you can run to gain insight [about] what is/isn’t working with a learning algorithm, and gain guidance as to how best to improve its performance. They take time to implement but can save you a lot of time by preventing you from going down fruitless paths. 19.2.1 Learning curves To generate a learning curve, you deliberately shrink the size of your training set and see how the training and validation errors change as you increase the training set size. This way you can see how your model improves (or doesn’t, if something unexpected is happening) with more training data. With smaller training sets, we expect the training error will be low because it will be easier to fit to less data. So as training set size grows, the average training set error is expected to grow. Conversely, we expect the average validation error to decrease as the training set size increases. If it seems like the training and validation error curves are flattening out at a high error as training set size increases, then you have a high bias problem. The curves flattening out indicates that getting more training data will not (by itself) help much. On the other hand, high variance problems are indicated by a large gap between the training and validation error curves as training set size increases. You would also see a low training error. In this case, the curves are converging and more training data would help. 542 CHAPTER 19. IN PRACTICE 543 19.2. MACHINE LEARNING DIAGNOSTICS Training figures (from Xiu-Shen Wei) CHAPTER 19. IN PRACTICE 543 19.3. LARGE SCALE MACHINE LEARNING 19.2.2 544 Important training figures 19.3 Large Scale Machine Learning 19.3.1 Map Reduce You can distribute the workload across computers to reduce training time. For example, say you’re running batch gradient descent with b = 400. θj := θj − α 4 1 ∑ 00i=1 (hθ (x (i) ) − y (i) )xj(i) 400 You can divide up (map) your batch so that different machines calculate the error of a subset (e.g. with 4 machines, each machine takes 100 examples) and then those results are combined (reduced/summed) back on a single machine. So the summation term becomes distributed. Map Reduce can be applied wherever your learning algorithm can be expressed as a summation over your training set. Map Reduce also works across multiple cores on a single computer. 19.4 Online (live/streaming) machine learning 19.4.1 Distribution Drift Say you train a model on some historical data and then deploy your model in a production setting where it is working with live data. It is possible that the distribution of the live data starts to drift from the distribution your model learned. This change may be due to factors in the real world that influence the data. Ideally you will be able to detect this drift as it happens, so you know whether or not your model needs adjusting. A simple way to do it is to continually evaluate the model by computing some validation metric on the live data. If the distribution is stable, then this validation metric should remain stable; if the distribution drifts, the model starts to become a poor fit for the new incoming data, and the validation metric will worsen. 19.5 References • Review of fundamentals, IFT725. Hugo Larochelle. 2012. • Exploratory Data Analysis Course Notes. Xing Su. 544 CHAPTER 19. IN PRACTICE 545 19.5. REFERENCES • Mining Massive Datasets (Coursera & Stanford, 2014). Jure Leskovec, Anand Rajaraman, Jeff Ullman. • Machine Learning. 2014. Andrew Ng. Stanford University/Coursera. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Evaluating Machine Learning Models. Alice Zheng. 2015. • Computational Statistics II (code). Chris Fonnesbeck. SciPy 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. • Deep Learning. Yoshua Bengio, Ian Goodfellow, Aaron Courville. • CS231n Convolutional Neural Networks for Visual Recognition, Module 1: Neural Networks Part 2: Setting up the Data and the Loss. Andrej Karpathy. • POLS 509: Hierarchical Linear Models. Justin Esarey. • Bayesian Inference with Tears. Kevin Knight, September 2009. • Learning to learn, or the advent of augmented data scientists. Simon Benhamou. • Practical Bayesian Optimization of Machine Learning Algorithms. Jasper Snoek, Hugo Larochelle, Ryan P. Adams. • What is the expectation maximization algorithm?. Chuong B Do & Serafim Batzoglou. • Gibbs Sampling for the Uninitiated. Philip Resnik, Eric Hardisty. June 2010. • Maximum Likelihood Estimation. Penn State Eberly College of Science. • Data Science Specialization. Johns Hopkins (Coursera). 2015. • Practical Machine Learning. Johns Hopkins (Coursera). 2015. • Elements of Statistical Learning. 10th Edition. Trevor Hastie, Robert Tibshirani, Jerome Friedman. • CS231n Convolutional Neural Networks for Visual Recognition, Linear Classification. Andrej Karpathy. • Must Know Tips/Tricks in Deep Neural Networks. Xiu-Shen Wei. CHAPTER 19. IN PRACTICE 545 19.5. REFERENCES 546 546 CHAPTER 19. IN PRACTICE 547 Part III Artificial Intelligence 547 549 19.6. STATE-SPACE AND SITUATION-SPACE REPRESENTATIONS # Search 19.6 State-space and situation-space representations In artificial intelligence, problems are often represented using the state-space representation (sometimes called a state-transition system), in which the possible states of the problem and the operations that move between them are represented as a graph or a tree: • • • • Nodes are (abstracted) world configurations (states) Arcs represent successors (action results) A goal test is a set of goal nodes (which may just include a single goal) Each state occurs only once as a node More formally, we consider a problem to have a set of possible starting states S, a set of operators F which can be applied to the states, and a set of goal states G. A solution to a problem formalized in this way, called a procedure, consists of a starting state s ∈ S and a sequence of operators that define a path from s to a state in G. Typically a problem is represented as a tuple of these values, (S, F, G). The distinction between state-space and situation-space is as follows: if the relevant parts of the problem are fully-specified (fully-known), then we work with states and operators, and have a statespace problem. If there is missing information (i.e., the problem is partially-specified), then we work with situations and actions (note that operators are often referred to actions in state-space as well), and we have a situation-space problem. Most of what is said for state-space problems is applicable to situation-space problems. For now we will focus on state-space problems. This state-space model can be applied to itself, in such that a given problem can be decomposed into subproblems (also known as subgoals); the relationships between the problem and its subproblems (and their subproblems’ subproblems, etc) are also represented as a graph. Successor relationships can be grouped by AND or OR arcs which group edges together. A problem node with subproblems linked by AND edges must have all of the grouped subproblems resolved; a problem with subproblems linked by OR edges must have only one of the subproblems resolved. Using this graph, you can identify a path of subproblems which can be used to solve the primary problem. This process is known as problem reduction. We take this state-space representation as the basis for a search problem. 19.6.1 Search problems (planning) A search problem consists of: • an initial state • a set of possible actions/applicability conditions 549 19.6. STATE-SPACE AND SITUATION-SPACE REPRESENTATIONS 550 • a successor function: from a state to a set of (action, state) • the successor function plus the initial state is the state space (which is a directed graph as described before) • a path (i.e. a solution) • a goal (a goal state or a goal test function) • a path cost function (for optimality, generally it is the sum of the step costs) To clarify some terminology: • if node A leads to node B, then node A is a parent of B and B is a successor or child of A – if the edge connecting A to B is due to an operator q, we say that “B is a successor to A under the operator q”. • if a node has no successors, it is a terminal • if there is a path between node A and node C such that node A is a parent of a parent … of a parent of C, then A is an ancestor of C and C is a descendant of A. – if the graph is cyclical, e.g. there is a path from A through C back to A, then A is both an ancestor and a descendant of C. Practically, we may use a data structure for nodes that encapsulate the following information for the node: • • • • • state - a state in the state space parent node - the immediate predecessor in the search tree (only the root node has no parent) action - the action that, when performed in the parent node’s state, leads to this node’s state path cost - the path cost leading to this node depth - the depth of this node in the search tree In the context of artificial intelligence, a path through state-space is called a plan - search is fundamental to planning (other aspects of planning are covered in more detail later). 19.6.2 Problem formulation Problem formulation can itself be a problem, as it typically is with real-world problems. We have to consider how granular/abstract we want to be and what actions and states to include. To make this a bit easier, we typically make the following assumptions about the environment: • • • • finite and discrete fully observable deterministic static (no events) And other assumptions are typically included as well: 550 551 • • • • 19.7. SEARCH ALGORITHMS restricted goals sequential plans (no parallel activity in plans) implicit time (activities do not have a duration) offline planning (the state transition system is not changing while we plan) 19.6.3 Trees In practice, we rarely build the full state-space graph in memory (because it is often way too big). Rather, we work with trees. Trees have a few constraints: • only one node does not have a parent: the root node. • every other node in the tree is a descendant of the root node • every other node has only one parent An additional term relevant to trees is depth, which is the number of ancestors a node has. The root node is the current state and branches out into possible future states (i.e. the children are successors). Given a tree with branching factor b and maximum depth m, there are O(bm ) nodes in the tree. These trees can get quite big, so often we can’t build the full tree either (it would be infinite if there are circular paths in the state space graph). Thus we only build sections that we are immediately concerned with. To build out parts of the tree we are interested in, we take a node and apply a successor function (sometimes called a generator function) to expand the node, which gives us all of that node’s successors (children). There is also often a lot of repetition in search trees, which some search algorithm enhancements take advantage of. 19.7 Search algorithms We apply algorithms to this tree representation in order to identify paths (ideally the optimal path) from the root node (start state) to a goal node (a goal state, of which there may be many). Most search algorithms share some common components: • a fringe (sometimes called a frontier) of unexplored nodes is maintained • some process for deciding which nodes to expand The general tree search algorithm is as follows: • initialize the fringe with a search node for the initial state 551 19.8. UNINFORMED SEARCH • • • • • 552 iteratively: if the fringe is empty, return a failure otherwise, select a node from the fringe based on the current search strategy if this node’s state passes the goal test (or is the goal state), return the path to this node otherwise, expand the fringe with this node’s children (successors) Most search algorithms are based on this general structure, varying in how they choose which node to expand from the fringe. When considering search algorithms, we care about: • completeness - is it guaranteed to find a solution if one exists? • optimal - is it guaranteed to find the optimal solution, if one exists? • size complexity - how much space does the algorithm need? Basically, how big can the fringe get? • time complexity - how does runtime change with input size? Basically, how many nodes get expanded? These search algorithms discussed here are unidirectional, since they only expand in one direction (from the start state down, or, in some cases, from the terminal nodes up). However, there are also bidirectional search procedures which start from both the start state and from the goal state. They can be difficult to use, however. 19.8 Uninformed search Uninformed search algorithms, sometimes called blind search algorithms, vary in how they decide which node to expand. Consider the following search space, where S is our starting point and G is our goal: Example search space 19.8.1 Exhaustive (“British Museum”) search Exhaustively search all paths (without revisiting any previously visited points) - it doesn’t really matter how you decide which node to expand because they will all be expanded. 552 553 19.8. UNINFORMED SEARCH “British Museum” search 553 19.8. UNINFORMED SEARCH 19.8.2 • • • • 554 Depth-First Search (DFS) time complexity: expands O(bm ) nodes (if m is finite) size complexity: the fringe takes O(bm) space complete if m is not infinite (i.e. if there are no cycles) optimal: no, it finds the “leftmost” solution Go down the left branch of the tree (by convention) until you can’t go any further. If that is not your target, then backtrack - go up to the closest branching node and take the other leftmost path. Backtracking is a technique that appears in almost every search algorithm, where we try extending a path, and if the extension fails or is otherwise unsatisfactory, we take a step back and try a different successor. Repeat until you reach your target. Depth-first search It stops just on the first complete path, which may not be the optimal path. Another way to think about depth-first search is with a queue (LIFO) which holds your candidate paths as you construct them. Your starting “path” includes just the starting point: [(S)] Then on each iteration, you take the left-most path (which is always the first in the queue) and check if it reaches your goal. If it does not, you extend it to build new paths, and replace it with those new paths. [(SA), (SB)] On this next iteration, you again take the left-most path. It still does not reach your goal, so you extend it. And so on: [(SABC), (SAD), (SB)] [(SABCE), (SAD), (SB)] 554 555 19.8. UNINFORMED SEARCH You can no longer extend the left-most path, so just remove it from the queue. [(SAD), (SB)] Then keep going. 19.8.3 • • • • Breadth-First Search (BFS) time complexity: expands O(bs ) nodes, where s is the depth of the shallowest solution size complexity: the fringe takes O(bs ) space complete: yes optimal: yes, if all costs are 1, otherwise, a deeper path could have a cheaper cost Build out the tree level-by-level until you reach your target. Breadth-first search In the queue representation, the only thing that is different from depth-first is that instead of placing new paths at the front of the queue, you place them at the back. Another way of putting this is that instead of a LIFO data structure for its fringe (as is used with DFS), BFS uses a FIFO data structure for its fringe. 19.8.4 Uniform Cost Search We can make breadth-first search sensitive to path cost with uniform cost search (also known as Dijkstra’s algorithm), in which we simply prioritize paths by their cost g(n) (that is, the distance from the root node to n) rather than by their depth. • time complexity: If we say the solution costs C∗ and arcs cost at least ϵ, then the “effective C∗ depth” is roughly /epsilon , so the time complexity is O(bC∗/ϵ ) • size complexity: the fringe takes O(bC∗/ϵ ) space • complete: yes if the best solution has finite cost and minimum arc cost is positive • optimal: yes 555 19.9. SEARCH ENHANCEMENTS 19.8.5 556 Branch & Bound On each iteration, extend the shortest cumulative path. Once you reach your goal, extend every other extendible path to check that its length ends up being longer than your current path to the goal. The fringe is kept sorted so that the shortest path is first. Branch and bound search This approach can be quite exhaustive, but it can be improved by using extended list filtering. 19.8.6 Iterative deepening DFS The general idea is to combine depth-first search’s space advantage with breadth-first search’s time/shallow-solution advantages. • • • • Run depth-first search with depth limit 1 If no solution: Run depth-first search with depth limit 2 If no solution: – Run depth-first search with depth limit 3 (etc) 19.9 Search enhancements 19.9.1 Extended list filtering Extended list filtering involves maintaining a list of visited nodes and only expanding nodes in the fringe if they have not already been expanded - it would be redundant to search again from that node. For example, branch and bound search can be combined with extended list filtering to make it less exhaustive. 556 557 19.10. INFORMED (HEURISTIC) SEARCH Branch and bound search w/ extended list 19.10 Informed (heuristic) search Informed search algorithms improve on uninformed search by incorporating heuristics which tell us whether or not we’re getting closer to the goal. With heuristics, we can search less of the search space. In particular, we want admissible heuristics, which is simply a heuristic that never overestimates the distance to the goal. Formally, we can define the admissible heuristic as: H(x, G) ≤ D(x, G) That is, a node is admissible if the estimated distance H(x, G) between and node x and the goal G is less than or equal to the actual distance D(x, G) between the node and the goal. Note that sometimes inadmissible heuristics (i.e. those that sometimes overestimate the distance to the goal) can still be useful. The specific heuristic function is chosen depending on the particular problem (i.e. we estimate the distance to the goal state differently in different problems, for instance, with a travel route, we might estimate the cost with linear distance to the target city). The typical trade-off with heuristics is between simplicity/efficiency and accuracy. The question of finding good heuristics, and doing so automatically, has been a big topic in AI planning recently. 19.10.1 Greedy best-first search Best-first search algorithms are those that select the next node from the fringe by argminn f (n), where f (n) is some evaluation function. With greedy best-first search, the fringe is kept sorted by heuristic distance to the goal; that is, f (n) = h(n). This often ends up with a suboptimal path, however. 557 19.10. INFORMED (HEURISTIC) SEARCH 19.10.2 558 Beam Search Beam search is essentially breadth-first search, but we set a beam width w which is the limit to the number of paths you will consider at any level. This is typically a low number like 2, but can be iteratively expanded (similar to iterative deepening) if necessary. Beam search The fringe is the same as in breadth-first search, but we keep only the w best paths as determined by the heuristic distance. Beam search is not complete, unless the iterative approach is used. 19.10.3 A* Search A* is an extension of branch & bound search which includes (admissible) heuristic distances in its sorting. We define g(n) as the known distance from the root to the node n (this is what we sort the fringe by in branch & bound search). We additionally define h(n) as the admissible heuristic distance from the node n to a goal node. With A* search, we simply sort the fringe by g(n) + h(n). That is, A* search is a best-first search algorithm where f (n) = g(n) + h(n). A* search is optimal if h(n) is admissible; that is, it never overestimates the distance to the goal. It is complete as well; i.e. if a solution exists, A* will find it. A* is also optimally efficient (with respect to the number of expanded nodes) for a given heuristic function. That is, no other optimal algorithm is guaranteed to expand fewer nodes than A*. Uniform cost search is a special case of A* where h(n) = 0, i.e. f (n) = g(n). The downside of A* vs greedy best-first search is that it can be slower since it explores the space more thoroughly - it has worst case time and space complexity of O(B l ), where b is the branching factor (the number of successors per node on average) and l is the length of the path we’re looking for. Typically we are dealing with the worst case; the fringe usually grows exponentially. Sometimes the time complexity is permissible, but the space complexity is problematic because there may simply not be enough memory for some problems. 558 559 19.11. LOCAL SEARCH There is a variation of A* called iterative deepening A* (IDA*) which uses significantly less memory. 19.10.4 Iterative Deepening A* (IDA*) Iterative deepening A* is an extension of A* which uses an iterative approach, searching up to a distance g(x) and increasing that distance until a solution is found. 19.11 Local search Local search algorithms do not maintain a fringe; that is, we don’t keep track of unexplored alternatives. Rather, we continuously try to improve a single option until we can’t improve it anymore. Instead of extending a plan, the successor function in local search takes an existing plan and just modifies a part of it. Local search is generally much faster and more memory efficient, but because it does not keep track of unexplored alternatives, it is incomplete and suboptimal. 19.11.1 Hill-Climbing A basic method in local search is hill climbing - we choose a starting point, move to the best neighboring state (i.e. closest as determined by the heuristic), and repeat until there are no better positions to move to - we’ve reached the top of hill. As mentioned, this is incomplete and suboptimal, as it can end up in local maxima. Hill-climbing search The difference between hill climbing and greedy search is that with greedy search, the entire fringe is sorted by heuristic distance to the goal. With hill climbing, we only sort the children of the currently expanded node, choosing the one closest to the goal. 559 19.12. GRAPH SEARCH 19.11.2 560 Other local search algorithms You can also use simulated annealing (detailed elsewhere) to try to escape local maxima - and this helps, and has a theoretical guarantee that it will converge to the optimal state given infinite time, but of course, this is not a practical guarantee for real-world applications. So simulated annealing in practice can do better but still can end up in local optima. You can also use genetic algorithms (detailed elsewhere). 19.12 Graph search Up until now we have considered search algorithms in the context of trees. With search trees, we often end up with states repeated throughout the tree, which will have redundant subtrees, and thus end up doing (potentially a lot of) redundant computation. Instead, we can consider the search space as a graph. Graph search algorithms are typically just slight modifications of tree search algorithms. One main modification is the introduction of a list of explored (or expanded) nodes, so that we only expand states which have not already expanded. Completeness is not affected by graph search, but it is not optimal. We may close off a branch because we have already expanded that state elsewhere, but it’s possible that the shortest path still goes through that state. Graph search algorithms (such as the graph search version of A*) can be made optimal through an additional constraint to admissible heuristics: consistency. 19.12.1 Consistent heuristics The main idea of consistency is that the estimated heuristic costs should be less than or equal to the actual costs for each arc between any two nodes, not just between any node and the goal state: |H(x, G) − H(y , G)| ≤ D(x, y ) That is, the absolute value of the difference between the estimated distance between a node x and the goal and the estimated distance between a node y and the goal is less than or equal to the distance between the nodes x and y . Consistency enforces this for any two nodes, which includes the goal node, so consistency implies admissibility. 19.13 Adversarial search (games) Adversarial search is essentially search for games involving two or more players. 560 561 19.13. ADVERSARIAL SEARCH (GAMES) There are many kinds of games - here we primarily consider games that are: • • • • deterministic (sometimes called “non-chance”) two-player turn-based zero-sum: agents have opposite utilities (one’s gain is another’s loss), also known as a “pure competition” or “strictly competitive” game. • perfect information: every player has full knowledge of what state the game is in and what actions are possible Note that while we are only considering zero-sum games, in general games agents have independent utilities, so there is opportunity for cooperation, indifference, competition, and so on. One way of formulating of games (there are many) is as a tree: • • • • • • states S, starting with s0 players P = {1, . . . , n}, usually taking turns actions A (may depend on player/state) transition function (analogous to a successor function), S × A → S terminal test (analogous to a goal test): S → {t, f } terminal utility function (computes how much an end/terminal state is worth to each player): S × P → R. For example, we may assign a utility of 100 for terminal states where we win, and -100 for terminal states where we lose. We want our adversarial search algorithm to return a strategy (a policy) which is essentially a function which returns an action given a state. That is, a policy tells us what action take in a state - this is constrasted to a plan, which details a step-by-step procedure from start to finish. This is because we can’t plan on opponents acting in a particular way, so we need a strategy to respond to their actions. The solution then, for an adversarial search algorithm for a player is a policy S → A. 19.13.1 Minimax In minimax, we start at the bottom of the tree (where we have utilities computed terminal nodes), moving upwards. We propagate the terminal utilities through the graph up to the root node, propagating the utility at each depth that satisfies a particular criteria. For nodes at depths that correspond to the opponent’s turns, we assume that the opponent chooses their best move (that is, we assume they are a perfect adversary), which means we propagate the minimum utility for us. For nodes at depths that correspond to our turn, we want to choose our best move; that is, we propagate the maximum utility for us. The propagated utility is known as the backed-up evaluation. 561 19.13. ADVERSARIAL SEARCH (GAMES) 562 Minimax At the end, this gives us a utility for the root node, which gives us a value for the current state. Minimax is just like exhaustive depth-first search, so its time complexity is O(bm ) and space complexity is O(bm). Minimax is optimal against a perfect adversarial player (that is, an opponent that always takes their best action), but it is not otherwise. Depth-limited minimax Most interesting games have game trees far too deep to expand all the way to the terminal nodes. Instead, we can use depth-limited search to only go down a few levels. However, since we don’t reach the terminal nodes, their values never propagate up the tree. How will we compute the utility of any given move? We can introduce an evaluation function which computes a utility for non-terminal positions, i.e. it estimates the value of an action. For instance, with chess, you could just take the different of the number of your units vs the number of the opponent’s units. Generally moves that lower your opponent’s units is better, but not always. Iterative deepening minimax Often in games there are some time constraints - for instance, the computer opponent should respond within a reasonable amount of time. Iterative deepening can be applied to minimax, running for a set amount of time, and return the best policy found thus far. This type of algorithm is called an anytime algorithm because it has an answer ready at anytime. Generalizing minimax If the game is not zero-sum or has multiple players, we can generalize minimax as such: • terminal nodes have utility tuples • node values are also utility tuples 562 563 19.13. ADVERSARIAL SEARCH (GAMES) • each player maximizes their own component This can model cooperation and competition dynamically. 19.13.2 Alpha-Beta We can further improve minimax by pruning the game tree; i.e. removing branches we know won’t be worthwhile. This variation is known as alpha-beta search. Alpha-Beta Minimax Here we can look at branching and figure out a bound for describing its score. First we look at the left-most branch and see the value 2 in its left-most terminal node. Since we are looking for the min here, we know that the score for this branch node will be at most 2. If we then look at the other terminal node, we see that it is 7 and we know the branch node’s score is 2. At this point we can apply a similar logic to the next node up (where we are looking for the max). We know that it will be at least 2. So then we look at the next branch node and see that it will be at most 1. We don’t have to look at the very last terminal node because now we know that the max node can only be 2. So we have saved ourselves a little trouble. In larger trees this approach becomes very valuable, since you are effectively discounting entire branches and saving a lot of unnecessary computation. This allows you to compute deeper trees. Note that with alpha-beta, the minimax value computed for the root is always correct, but the values of intermediate nodes may be wrong, and as such, (naive) alpha-beta is not great for action selection. Good ordering of child nodes improves upon this. With a “perfect ordering”, time complexity drops to O(bm/2 ). Ordering Generally you want to generate game trees so that successors to each node are ordered left-to-right in descending order of their eventual backed-up evaluations (such an ordering is called the “correct” ordering). Naturally, it is quite difficult to generate this ordering before these evaluations have been computed. 563 19.14. NON-DETERMINISTIC SEARCH 564 Thus a plausible ordering must suffice. These are a few techniques for generating plausible orderings of nodes: • Generators first produce the most immediately desirable choices (though without regard to possible consequences further on) • Shallow search first generates some of the tree and then uses some static evaluation function and compute backed-up evaluations upwards to order the results. • Dynamic generation, in which alpha-beta is applied to identify plausible branches of the game tree, then branch is evaluated which can cause the ordering to change. 19.14 Non-deterministic search In many situations the outcomes of actions are uncertain. Another way of phrasing is that actions may be noisy. Like adversarial search, non-deterministic search solutions take the form of policies. 19.14.1 Expectimax search We can model uncertainty as a “dumb” adversary in a game. Whereas in minimax we assume a “smart” adversary, and thus consider worst-case outcomes (i.e. that the opponent plays their best move), with non-deterministic search, we instead consider average-case outcomes (i.e. expected utilities). This is called expectimax search. So instead of minimax’s min nodes, we have “chance” nodes, though we still keep max nodes. For a chance node, we compute its expected utility as the weighted (by probability) average of its children. Because we take the weighted average of children for a chance node’s utility, we cannot use alpha-beta pruning as we could with minimax. There could conceivably be an unexplored child which increases the expected utility enough to make that move ideal, so we have to explore all child nodes to be sure. Expectiminimax We can have games that involve adversaries and chance, in which case we would have both minimax layers and expectimax layers. This approach is called expectiminimax. 19.14.2 Monte Carlo Tree Search Say you are at some arbitrary position in your search tree (it could be the start or somewhere further along). You can treat the problem of what node to move to next as a multi-armed bandit problem and apply the Monte Carlo search technique. 564 565 19.14. NON-DETERMINISTIC SEARCH Multi-armed bandit Say you have multiple options with uncertain payouts. You want to maximize your overall payout, and it seems the most prudent strategy would be to identify the one option which consistently yields better payouts than the other options. However - how do you identify the best option, and do so quickly? This problem is known as the multi-armed bandit problem, and a common strategy is based on upper confidence bounds (UCB). To start, you randomly try the options and compute confidence intervals for each options’ payout: √ x̄i ± 2 ln(n) ni where: • x̄i is the mean payout for option i • ni is the number of times option i was chosen • n is the total number of trials You take the upper bound of these confidence intervals and continue to choose the option with the highest upper bound. As you use this option more, it’s confidence interval will narrow (since you have collected more data on it), and eventually another option’s confidence interval upper bound will be higher, at which point you switch to that option. Monte Carlo Tree Search At first, you have no statistical information about the child nodes to compute confidence intervals. So you randomly choose a child and run Monte Carlo simulations down that branch to see the outcomes. For each simulation run, you go along each node in the branch that was walked and increment its play count (i.e. number of trials) by 1, and if the outcome is a win, you increment its win count by 1 as well (this explanation assumes a game, but is generalizes to other cases). You repeat this until you have enough statistics for the direct child nodes of your current position to make a UCB choice as to where to move next. You will need to run less simulations over time because you accumulate these statistics for the search tree. First-Play Urgency (FPU) A variation of MCTS where fixed scores are assigned to unvisited nodes. 565 19.14. NON-DETERMINISTIC SEARCH 19.14.3 566 Markov Decision Processes (MDPs) MDPs are another way of modeling non-deterministic search. MDPs are essentially Markov models, but there’s a choice of action. In MDPs, there may be two types of rewards (which can be positive or negative): • terminal rewards (i.e. those that come at the end, these aren’t always present) • “living” rewards, which are given for each step (these are always present) For instance, you could imagine a maze arranged on a grid. The desired end of the maze has a positive terminal reward and a dead end of the maze has a negative terminal reward. Every nonterminal position in the maze also has a reward (“living” rewards) associated with it. Often these living rewards are negative so that each step is penalized, thus encouraging the agent to find the desired end in as few steps as possible. The agent doesn’t have complete knowledge of the maze so every action has an uncertain outcome. It can try to move north - sometimes it will successfully do so, but sometimes it will hit a wall and remain in its current position. Sometimes our agent may even move in the wrong direction (e.g. maybe a wheel gets messed up or something). This kind of scenario can be modeled as a Markov Decision Process, which includes: a set of states s ∈ S a set of actions a ∈ A a transition function T (s, a, s ′ ), sometimes called a state transition matrix gives the probability that a from s leads to s ′ , i.e. P (s ′ |s, a) also called the “model” or the “dynamics” a reward function R(s, a, s ′ ) (sometimes just R(s) or R(s ′ )), sometimes called a utility function, which associates a reward (or penalty) with each state • a discount γ • a start state • maybe a terminal state • • • • • • MDPs, as non-deterministic search problems, can be solved with expectimax search. MDPs are so named because we make the assumption that action outcomes depend only on the current state (i.e. the Markov assumption). The solution of an MDP is an optimal policy π∗ : S → A: • gives us an action to take for each state • an optimal policy maximizes expected utility if followed • an explicit policy defines a reflex agent 566 567 19.14. NON-DETERMINISTIC SEARCH In contrast, expectimax does not give us entire policies. Rather, it gives us an action for a single state only. It’s similar to a policy, but requires re-computing at each step. Sometimes this is fine because a problem may be too complicated to compute an entire policy anyways. The objective MDP is to maximize the expected sum of all future rewards, i.e. max(E[ ∞ ∑ Rt ]) t=0 Sometimes a discount factor γ ∈ [0, 1] is included, e.g. γ = 0.9, which decays future reward: max(E[ ∞ ∑ γ t Rt ]) t=0 Using this, we can define a value function V (s) for each state: V π (s) = E[ ∑ γ t Rt |s0 = s] t=0 That is, it is the expected sum of future discounted reward provided we start in state s with policy π. This can be computed empirically via simulations. In particular, we can use the value iteration algorithm. With value iteration, we recursively calculate the value function, starting from the goal states, to get the optimal value function, from which we can derive the optimal policy. More formally - we want to recursively estimate the value V (s) of a state s. We do this by estimating the value of possible successor states s ′ , discounting by γ, and incorporating the reward/cost of the state R(s ′ ), across possible actions from s. We take the maximum of these estimates. V (s) = max[γ ∑ a s′ P (s ′ |s, a)V (s ′ )] + R(s) This method is called back-up. In terminal states, we just set V (s) = R(s). We estimate these values over all our states - these estimates eventually converge. This function essentially defines the optimal policy - that is: π(s) = argmax a ∑ s′ P (s ′ |s, a)V (s ′ ) (since it’s maximization we can drop γ and R(s)) 567 19.14. NON-DETERMINISTIC SEARCH 568 Example: Grid World Note that the X square is a wall. Every movement has an uncertain outcome, e.g. if the agent moves to the east, it may only successfully do so with an 80% chance. For R(s) = −0.01: A B C D 0 → → → +1 1 ↑ X ← -1 2 ↑ ← ← ↓ At C1 the agent plays very conservatively and moves in the opposite direction of the negative terminal position because it can afford doing so many times until it accidentally randomly moves to another position. Similar reasoning is behind the policy at D2. For R(s) = −0.03: A B C D 0 → → → +1 1 ↑ X ↑ -1 2 ↑ ← ← ← With a stronger step penalty, the agent finds it better to take a risk and move upwards at C1, since it’s too expensive to play conservatively. Similar reasoning is behind the change in policy at D2. For R(s) = −2: A B C D 0 → → → +1 1 ↑ X → -1 2 → → → ↑ With such a large movement penalty, the agent decides it’s better to “commit suicide” by diving into the negative terminal node and end the game as soon as possible. 568 569 19.14. NON-DETERMINISTIC SEARCH q-states Each MDP state projects an expectimax-like search tree; that is, we build a search tree from the current state detailing what actions can be taken and the possible outcomes for each action. We can describe actions and states together as a q-state (s, a). When you’re in a state s and you take an action a, you end up in this q-state (i.e. you are committed to action a in state s) and the resolution of this q-state is described by the transition (s, a, s ′ ), described by the probability which is given by transition function T (s, a, s ′ ). There is also a reward associated with a transition, R(s, a, s ′ ), which may be positive or negative. Utility sequences How should we encode preferences for sequences of utilities? For example, should the agent prefer the reward sequence [0, 0, 1] or [1, 0, 0]? It’s reasonable to prefer rewards closer in time, e.g. to prefer [1, 0, 0] over [0, 0, 1]. We can model this by discounting, that is, decaying reward value exponentially. If a reward is worth 1 now, it is worth γ one step later, and worth γ 2 two steps later (γ is called the “discount” or “decay rate”). Stationary preferences are those which are invariant to the inclusion of another reward which delays the others in time, i.e.: [a1 , a2 , . . . ] ≻ [b1 , b2 , . . . ] ⇔ [r, a1 , a2 , . . . ] ≻ [r, b1 , b2 , . . . ] Nonstationary preferences are possible, e.g. if the delay of a reward changes its value relative to other rewards (maybe it takes a greater penalty for some reason). With stationary preferences, there are only two ways to define utilities: • Additive utility: U([r0 , r1 , r2 , . . . ]) = r0 + r1 + r2 + . . . • Discounted utility: U([r0 , r1 , r2 , . . . ]) = r0 + γr1 + γ 2 r2 + . . . Note that additive utility is just discounted utility where γ = 1. For now we will assume stationary preferences. If a game lasts forever, do we have infinite rewards? Infinite rewards makes it difficult to come up with a good policy. We can specify a finite horizon (like depth-limited search) and just consider only up to some fixed number of steps. This gives us nonstationary policies, since π depends on the time left. Alternatively, we can just use discounting, where 0 < γ < 1: U([r0 , . . . , r∞ ]) = ∞ ∑ t=0 γ t rt ≤ Rmax 1−γ 569 19.14. NON-DETERMINISTIC SEARCH 570 A smaller γ means a shorter-term focus (a smaller horizon). Another way is to use an absorbing state. That is, we guarantee that for every policy, a terminal state will eventually be reached. Usually we use discounting. Solving MDPs We say that the value (utility) of a state s is V ∗ (s), which is the expected utility of starting in s and acting optimally. This is equivalent to running expectimax from s. While a reward is for a state in a single time step, a value is the expected utility over all paths from that state. The value (utility) of a q-state (s, a) is Q∗ (s, a), called a Q-value, which is the expected utility starting out taking action a from state s and subsequently acting optimally. This is equivalent to running expectimax from the chance node that follows from s when taking action a. The optimal policy π ∗ (s) gives us the optimal action from a state s. So the main objective is to compute (expectimax) values for the states, since this gives us the expected utility (i.e. average sum of discounted rewards) under optimal action. More concretely, we can define value recursively: V ∗ (s) = max Q∗ (s, a) a Q∗ (s, a) = ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )] s′ These are the Bellman equations. They can be more compactly written as: V ∗ (s) = max a ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )] s′ Again, because these trees can go on infinitely (or may just be very deep), we want to limit how far we search (that is, how far we do this recursive computation). We can specify time-limited values, i.e. define Vk (s) to be the optimal value of s if the game ends in k more time steps. This is equivalent to depth-k expectimax from s. To clarify, k = 0 is the bottom of the tree, that is, k = 0 is the last time step (since there are 0 more steps to the end). We can use this with the value iteration algorithm to efficiently compute these Vk (s) values in our tree: • start with V0 (s) = 0 (i.e. with no time steps left, we have an expected reward sum of zero). Note that this is a zero vector over all states. 570 571 19.14. NON-DETERMINISTIC SEARCH • given a vector of Vk (s) values, do one ply of expectimax from each state: Vk+1 (s) = max a ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γVk (s ′ )] s′ Note that since we are starting at the last time step k = 0 and moving up, when we compute Vk+1 (s) we have already computed Vk (s ′ ), so this saves us extra computation. Then we simply repeat until convergence. This converges if the discount is less than 1. With the value iteration algorithm, each iteration has complexity O(S 2 A). There’s no penalty for depth here, but the more states you have, the slower this gets. The approximations get refined towards optimal values the deeper you go into the tree. However, the policy may converge long before the values do - so while you may not have a close approximation of values, the policy/strategy they convey early on may already be optimal. Partially-Observable MDPs (POMDPs) Partially-observed MDPs are MDPs in which the states are not (fully) observed. They include observations O and an observation function P (o|s) (sometimes notated O(s, o); it gives a probability for an observation given a state). When we take an action, we get an observation which puts us in a new belief state (a distribution of possible states). Partially-observable environments may require information-gathering actions in addition to goaloriented actions. Such information-gathering actions may require detours from goals but may be worth it in the long run. See the section on reinforcement learning for more. With POMDPs the state space becomes very large because there are many (infinite) probability distributions over a set of states. As a result, you can’t really run value iteration on POMDPs, but you can use approximate Q-learning (see the section on reinforcement learning) or a truncated (limited lookahead) expectimax approach to approximate the value of actions. In general, however, POMDPs are very hard/expensive to solve. 19.14.4 Decision Networks Decision networks are a generalization of Bayes’ networks. Some nodes are random variables (these are essentially embedded Bayes’ networks), some nodes are action variables, in which a decision is made, and some nodes are utility functions, which computes a utility for its parent nodes. For instance, an action node could be “bring (or don’t bring) an umbrella”, and a random variable node could be “it is/isn’t raining”. These nodes may feed into a utility node which computes a utility based on the values of these nodes. For instance, if it is raining and we don’t bring an umbrella, we 571 19.14. NON-DETERMINISTIC SEARCH 572 will have a very low utility, compared to when it isn’t raining and we don’t bring an umbrella, for which we will have a high utility. We want to choose actions that maximize the expected utility given observed evidence. The general process for action selection is: • • • • • instantiate all evidence set action node(s) each possible way calculate the posterior for all parents of the utility node, given the evidence calculated the expected utility for each action choose the maximizing action (it will vary depending on the observed evidence) This is quite similar to expectimax/MDPs, except now we can incorporate evidence we observe. An example decision network. Rectangles are action nodes, ellipses are chance nodes, and diamonds are utility nodes. From Artificial Intelligence: Foundations of Computational Agents Value of information More evidence helps, but typically there is a cost to acquiring it. We can quantify the value of acquiring evidence as the value of information to determine whether or not it is more evidence is worth the cost. We can compute this with a decision network. The value of information is simply the expected gain in the maximum expected utility given the new evidence. For example, say someone hides 100 dollars behind one of two doors, and if we can correctly guess which door it is behind, we get the money. There is a 0.5 chance that the money is behind either door. In this scenario, we can use the following decision network: choose door → U money door → U 572 573 19.14. NON-DETERMINISTIC SEARCH Where choose door is the action variable, money door is the random variable, and U is the utility node. The utility function at U is as follows: choose door money door utility a a 100 a b 0 b a 0 b b 100 In this current scenario, our maximum expected utility is 50. That is, choosing either door a or b gives us 100 × 0.5 = 50 expected utility. How valuable is knowing which door the money is behind? We can consider that if we know which door the money is behind, our maximum expected utility becomes 100, so we can quantify the value of that information as 100 − 50 = 50, which is what we’d be willing to pay for that information. In this scenario, we get perfect information, because we observe the evidence “perfectly” (that is, our friend tells us the truth and there’s no chance that we misheard them). More formally, the value of perfect information of evidence E ′ , given existing evidence e (of which there might be none), is: VPI(E ′ |e) = ( ∑ e′ P (e ′ |e)MEU(e, e ′ )) − MEU(e) Properties of VPI: • nonnegative: ∀E ′ , e : VPI(E ′ |e) ≥ 0, i.e. is not possible for VPI to be negative (proof not shown) • nonadditive: VPI(Ej , Ek |e) ̸= VPI(Ej |e) + VPI(Ek |e) (e.g. consider observing the same evidence twice - no more information is added) • order-independent: VPI(Ej , Ek |e) = VPI(Ej |e) + VPI(Ek |e, Ej ) = VPI(Ek |e) + VPI(Ej |e, Ek ) Also: generally, if the parents of the utility node is conditionally independent of another node Z given the current evidence e, then VPI(Z|e) = 0. Evidence has to affect the utility node’s parents to actually affect the utility. What’s the value of imperfect information? Well, we just say that “imperfect” information is perfect information of a noisy version of the variable in question. For example, say we have a “light level” random variable that we observe through a sensor. Sensors always have some noise, so we add an additional random variable to the decision network (connected to the light level random variable) which corresponds to the sensor’s light level measurement. Thus 573 19.15. POLICIES 574 the sensor’s observations are “perfect” in the context of the sensor random variable, because they are exactly what the sensor observed, though technically they are noisy in the context of the light level random variable. 19.15 19.15.1 Policies Policy evaluation How do we evaluate policies? We can compute the values under a fixed policy. That is, we construct a tree based on the policy (it is a much simpler tree because for any given state, we only have one action - the action the policy says to take from that state), and then compute values from that tree. More specifically, we compute the value of applying a policy π from a state s: V π (s) = ∑ T (s, π(s), s ′ )[R(s, π(s), s ′ ) + γV π (s ′ )] s′ Again, since we only have one action to choose from, the maxa term has been removed. We can use an approach similar to value iteration to compute these values, i.e. V0π (s) = 0 π Vk+1 (s) = ∑ s′ T (s, π(s), s ′ )[R(s, π(s), s ′ ) + γVkπ (s ′ )] This approach is sometimes called simple value iteration since we’ve dropped maxa . This has complexity O(S 2 ) per iteration. 19.15.2 Policy extraction Policy extraction is the problem opposite to policy evaluation - that is, given values, how do we extract the policy which yields these values? Say we have optimal values V ∗ (s). We can extract the optimal policy π ∗ (s) like so: π ∗ (s) = argmax a ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γV ∗ (s ′ )] s′ That is, we do one step of expectimax. What if we have optimal Q-values instead? With Q-values, it is trivial to extract the policy, since the hard work is already capture by the Q-value: 574 575 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) π ∗ (s) = argmax Q∗ (s, a) a 19.15.3 Policy iteration Value iteration is quite slow - O(S 2 A) per iteration. However, you may notice that the maximum value calculated for each state rarely changes. The result of this is that the policy often converges long before the values. Policy iteration is another way of solving MDPs (an alternative to value iteration) in which we start with a given policy and improve on it iteratively: • First, we evaluate the policy (calculate utilities for the given policy until the utilities converge). • Then we update the policy using one-step look-ahead (one-step expectimax) with the resulting converged utilities as the future (given) values (i.e. policy extraction). • Repeat until the policy converges. Policy iteration is optimal and, under some conditions, can converge must faster. More formally: Evaluation: iterate values until convergence: πi Vk+1 (s) = ∑ s′ T (s, πk (s), s ′ )[R(s, πk (s), s ′ ) + γVkπi (s ′ )] Improvement: compute the new policy with one-step lookahead: πi+1 (s) = argmax a ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γV πi (s ′ )] s′ Policy iteration and value iteration are two ways of solving MDPs, and they are quite similar - they are just variations of Bellman updates that use one-step lookahead expectimax. 19.16 Constraint satisfaction problems (CSPs) Search as presented thusfar has been concerned with producing a plan or a policy describing how to act to achieve some goal state. However, there are search problems in which the aim is to identify the goal states themselves - such problems are called identification problems. In constraint satisfaction problems, we want to identify states which statisfy a set of constraints. We have a set of variables Xi , with values from a domain D (sometimes the domain varies according to i , e.g. X1 may have a different domain than X2 ). We assign each variable Xi with a value from 575 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) 576 its corresponding domain, each unique assignment of these variables (which may be partial, i.e. some may be unassigned) is a state. We want to satisfy a set of constraints on what combinations of values are allowed on different subsets of variables. So we want to identify states which satisfy these constraints; that is, we want to identify variable assignments that satisfy the constraints. Constraints can be specified using a formal language, e.g. code that A ̸= B or something like that. We can represent constraints as a graph. In a binary CSP, each constraint relates at most two variables. We can construct a binary constraint graph in which the nodes are variables, and arcs show constraints. We don’t need to specify what the constraints are. If we have constraints that are more than binary (that is, they relate more than just two variables), we can represent the constraints as square nodes in the graph and link them to the variables they relate (as opposed to representing constraints as the arcs themselves). General-purpose CSP algorithms use this graph structure for faster search. 19.16.1 Varieties of CSPs Variables may be: • • • • discrete, and come from finite domains infinite domains (integers, strings, etc) continuous Constraints may be: • unary (involve a single variable, this is essentially reducing a domain, e.g. A ̸= green) • binary (involve a pair of variables) • higher-order (involve three or more variables) We may also have preferences, i.e. soft constraints. We can represent these as costs for each variable assignment. This gives us a constraint optimization problem. 19.16.2 Search formulation We can formulate CSPs as search problems using search trees or search graphs (in the context of CSPs, they are called constraint graphs). States are defined by the values assigned so far (partial assignments). The initial state is the empty assignment, {}. Successor functions assign a value to an unassigned variable (one at a time). 576 577 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) The goal test is to check if the current assignment is complete (all variables have values) and satisfies all constraints. Breadth-first search does not work well here because all the solutions will be at the bottom of the search tree (all variables must have values assigned, and that happens only at the bottom). Depth-first search does a little better, but it is very naive - it can make a mistake early on in its path, but not realize it until reaching the end of a branch. The main shortcoming with these approaches is that we aren’t checking constraints until it’s far too late. 19.16.3 Backtracking search Backtracking search is the basic uninformed search algorithm for solving CSPs. It is a simple augmentation of depth-first search. Rather than checking the constraint satisfaction at the very end of a branch, we check constraints as we go, i.e. we only try values that do not conflict with previous assignments. This is called an incremental goal test. Furthermore, we only consider one variable at a time in some order. Variable assignments are commutative (i.e. the order in which we assign them doesn’t matter, e.g. A = 1 and then B = 2 leads to the same variable assignment as B = 2 then A = 1). So at one level, we consider assignments for A, at the next, for B, and so on. The moment we violate a constraint, we backtrack and try different a variable assignment. Simple backtracking can be improved in a few ways: • • • • • ordering we can be smarter about in what order we assign variables we can be smarter about what we try for the next value for a variable filtering: we can detect failure earlier structure: we can exploit the problem structure Backtracking pseudocode: def backtracking(csp): def backtracing_recursive(assignment): if is_complete(assignment): return assignment var = select_unassigned_variable(csp.variables, assignment) for val in csp.order_domain_values(var, assignment): if is_consistent_with_constraints(val, assignment, csp.constraints): assignment[var] = val result = backtracking_recursive(assignment) if result is not None: # if not a failure 577 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) 578 return result else: # otherwise, remove the assignment del assignment[var] return None # failure Filtering Filtering looks ahead to eliminate incompatible variable assignments early on. With forward checking, when we assign a new variable, we look ahead and eliminate values for other variables that we know will be incompatible with this new assignment. So when we reach that variable, we only have to check values we know will not violate a constraint (that is, we only have to consider a subset of the variable’s domain). If we reach an empty domain for a variable, we know to backup. With constraint propagation methods, we can check for failure ahead of time. One constraint propagation method is arc consistency (AC3). First, we must consider the consistency of an arc (here, in the context of binary constraints, but this can be extended to higher-order constraints). In the context of filtering, an arc X → Y is consistent if and only if for every x in the tail there is some y in the head which could be assigned without violating a constraint. An inconsistent arc can be made consistent by deleting values from its tail; that is, by deleting tail values which lead to constraint-violating head values. Note that since arcs are directional, a consistency relationship (edge) must be checked in both directions. We can re-frame forward checking as just enforcing consistency of arcs pointing to each new assignment. A simple form of constraint propagation is to ensure all arcs in the CSP graph are consistent. Basically, we visit each arc, check if its consistent, if not, delete values from its tail until it is consistent. If we encounter an empty domain (that is, we’ve deleted all values from its tail), then we know we have failed. Note that if a value is deleted from a tail of a node, its incoming arcs must be-rechecked. We combine this with backtracking search by applying this filtering after each new variable assignment. It’s extra work at each step, but it should save us backtracking. Arc consistency (AC3) pseudocode: function AC3(csp): queue = csp.all_arcs() while queue: from_node, to_node = queue.pop() if remove_inconsistent_values(from_node, to_node): for node in neighbors(from_node): 578 579 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) queue.append((node, from_node)) return csp function remove_inconsistent_values(from_node, to_node): removed = False for x in domain[from_node]: if no value y in domain[to_node] allows (x,y) to satisfy the constraint from_node <-> to_node domain[from_node].remove(x) removed = True return removed Arc consistency can be generalized to k-consistency: • 1-consistency is node consistency, i.e. each node’s domain has a value which satisfies its own unary constraints. • 2-consistency is arc consistency: for each pair of nodes, any consistent assignment to one can be extended to the other (“extended” meaning from the tail to the head). • k-consistency: for each k nodes, any consistent assignment to k − 1 can be extended to the kth node. • 3-consistency is called path consistency Naturally, a higher k consistency is more expensive to compute. We can extend this further with strong k-consistency which means that all lower orders of consistency (i.e. k − 1 consistency, k − 2 consistency, etc) are also satisfied. With strong k-consistency, no backtracking is necessary - but in practice, it’s never practical to compute. Ordering One method for selecting the next variable to assign to is called minimum remaining values (MRV), in which we choose the variable with the fewest legal values left in its domain (hence this is sometimes called most constrained variable). We know this number if we are running forward checking. Essentially we decide to try the hardest variables first so if we fail, we fail early on and thus have to do less backtracking (for this reason, this is sometimes called fail-fast ordering). For choosing the next value to try, a common method is least constraining value. That is, we try the value that gives us the most options later on. We may have to re-run filtering to determine what the least constraining value is. Problem Structure Sometimes there are features of the problem structure that we can use to our advantage. For example, we may have independent subproblems (that is, we may have multiple connected components; i.e. isolated subgraphs), in which case we can divide-and-conquer. In practice, however, you almost never see independent subproblems. 579 19.16. CONSTRAINT SATISFACTION PROBLEMS (CSPS) 580 Tree-Structured CSPs Some CSPs have a tree structure (i.e. have no loops). Tree-structured CSPs can be solved in O(nd 2 ) time, much better than the O(d n ) for general CSPs. The algorithm for solving tree-structured CSPs is as follows: 1. For order in a tree-structured CSP, we first choose a root variable, then order variables such that parents precede children. 2. Backward pass: starting from the end moving backwards, we visit each arc once (the arc pointing from parent to child) and make it consistent. 3. Forward assignment: starting from the root and moving forward, we assign each variable so that it is consistent with its parent. This method has some nice properties: • after the backward pass, all root-to-leaf arcs are consistent • if root-to-leaf arcs are consistent, the forward assignment will not backtrack Unfortunately, in practice you don’t typically encounter tree-structured CSPs. Rather, we can improve an existing CSPs structure so that it is nearly tree-structured. Sometimes there are just a few variables which prevent the CSP from having a tree structure. With cutset conditioning, we assign values to these variables such that the rest of the graph is a tree. This, for example, turns binary constraints into unary constraints, e.g. if we have a constraint A ̸= B and we fix B = green, then we can rewrite that constraint as simply A ̸= green. Cutset conditioning with a cutset size c gives runtime O(d c (n − c)d 2 ), so it is fast for a small c. More specifically, the cutset conditioning algorithm: 1. choose a cutset (the variables to set values for) 2. instantiate the cutset in all possible ways (e.g. produce a graph for each possible combination of values for the cutset) 3. for each instantiation, compute the residual (tree-structured) CSP by removing the cutset constraints and replacing them with simpler constraints (e.g. replace binary constraints with unary constraints as demonstrated above) 4. solve the residual CSPs Unfortunately, finding the smallest cutset is an NP-hard problem. There are other methods for improving the CSP structure, such as tree decomposition. Tree decomposition involves creating “mega-variables” which represent subproblems of the original problem, such that the graph of these mega-variables has a tree structure. For each of these megavariables we consider valid combinations of assignments to its variables. These subproblems must overlap in the right way (the running intersection property) in order to ensure consistent solutions. 580 581 19.16.4 19.17. ONLINE EVOLUTION Iterative improvement algorithms for CSPs Rather than building solutions step-by-step, iterative algorithms start with an incorrect solution and try to fix it. Such algorithms are local search methods in that they work with “complete” states (that is, all variables are assigned, though constraints may be violated/unsatisfied), and there is no fringe. Then we have operators which reassign variable values. A very simple iterative algorithm: • while not solved • randomly select any conflicted variable • select a value which violates the fewest constraints (the min-conflicts heuristic), i.e. hill climb with h(n) = num. of violated constraints In practice, this min-conflicts approach tends to perform quickly for randomly-generated CSPs; that is, there are some particular CSPs which are very hard for it, but for the most part, it can perform in almost constant time for arbitrarily difficult randomly-generated CSPs. Though, again, unfortunately many real-world CSPs fall in this difficult domain. 19.17 Online Evolution Multi-action adversarial games (assuming turn-based) are tricky because they have enormous branching factors. The problem is no longer what the best single action is for a turn - now we need to find the best sequence of actions to take. An evolutionary algorithm can be applied to select these actions in a method called online evolution because the agent doesn’t not learn in advance (offline learning), rather, it learns the best moves while it plays. Online evolution evolves the actions in a single turn and uses an estimation of the state at the end of the turn as a fitness function. This is essentially a single iteration of rolling horizon evolution, a method that evolves a sequence of actions and evolves new action sequences as those actions are executed. In its application here, we have a horizon of just one turn. An individual (to be evolved) in this context is a candidate sequence of actions for the turn. A basic genetic algorithm can be applied. The fitness function can include rollouts, e.g. to a depth of one extra turn, to incorporate how an opponent might counter move, but it may not help performance. 19.18 References • Introduction to Monte Carlo Tree Search. Jeff Bradberry. • MIT 6.034 (Fall 2010): Artificial Intelligence. Patrick H. Winston. MIT. • Introduction to Artificial Intelligence (2nd ed). Philip C. Jackson, Jr. 1985. 581 19.18. REFERENCES 582 • Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012. • Planning Algorithms. Steven M. LaValle. 2006. • Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of Edinburgh (Coursera). 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • Logical Foundations of Artificial Intelligence (1987) (Chapter 12: Planning) • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Artificial Intelligence: Foundations of Computational Agents. David Poole, Alan Mackworth. • Algorithmic Puzzles. Anany Levitin, Maria Levitin. 2011. • A way to deal with enormous branching factors. Julian Togelius. March 25, 2016. • Online Evolution for Multi-Action Adversarial Games. Niels Justesen, Tobias Mahlmann, Julian Togelius. 582 583 20 Planning Planning is tricky because: • environmental properties: – systems may be stochastic – there may be multiple agents in the system – there may be partial observability (that is, the state of the system may not be fully known) • agent properties: – some information may be unknown – plans are hierarchical (high level to low level parts) In planning we represent the world in belief states, in which multiple states are possible (because of incomplete information, we are not certain what the true state of the world is). Actions that are taken can either increase or decrease the possible states, in some cases down to one state. Sequences of actions can be defined as trees, where each branch is an action and each node is a state or is an observation of the world. Then we search this tree for a plan which will satisfy the goal. Broadly within planning, there are two kinds: • domain-specific planning, in which the representations and techniques are tailored for a particular problem (e.g. path and motion planning, perception planning, manipulation planning, communication planning) • domain-independent planning, in which generic representations and techniques are used The field of planning includes many areas of research and different techniques: • Domain modeling (HTN, SIPE) • Domain description (PDDL, NIST PSL) CHAPTER 20. PLANNING 583 20.1. AN EXAMPLE PLANNING PROBLEM • • • • • • • • • • • • • • • • • • • • • • • • 584 Domain analysis (TIMS) Search methods (Heuristics, A*) Graph planning algorithms (GraphPlan) Partial-order planning (Nonlin, UCPOP) Hierarchical planning (NOAH, Nonlin, O-Plan) Refinement planning (Kambhampati) Opportunistic search (OPM) Constraint satisfaction (CSP, OR, TMMS) Optimization method (NN, GA, ant colony optimization) Issue/flaw handling (O-Plan) Plan analysis (NOAH, Critics) Plan simulation (QinetiQ) Plan qualitative modeling (Excalibur) Plan repair (O-Plan) Re-planning (O-Plan) Plan monitoring (O-Plan, IPEM) Plan generalization (Macrops, EBL) Case-based planning (CHEF, PRODIGY) Plan learning (SOAR, PRODIGY) User interfaces (SIPE, O-Plan) Plan advice (SRI/Myers) Mixed-initiative plans (TRIPS/TRAINS) Planning web services (O-Plan, SHOP2) Plan sharing & comms (I-X, I-N-C-A) 20.1 An example planning problem Planning approaches are often presented on toy problems, which can be quite different from real-world problems. Namely, toy problems have a concise and exact description, but real-world problems seldom, if ever, have an agreed-upon or unambiguous description. They also have important consequences, whereas toy problems do not. But toy problems provide a standard way of comparing approaches. Some example toy problems: • • • • the the the the farmers, wolves, and the river sliding-block puzzle n-queens problem Dock-Worker Robots (DWR) domain (i.e. the container/block stacking problem) For the following notes on planning, we will use the Dock-Worker Robots problem as an toy problem: • we have some containers 584 CHAPTER 20. PLANNING 585 • • • • • 20.2. STATE-SPACE PLANNING VS PLAN-SPACE (PARTIAL-ORDER) PLANNING we have some locations, connected by paths containers can be stacked onto pallets (no limit to the height of these stacks) we have robots (vehicles) that can have a container loaded onto them each location can only have one robot at a time we have cranes which can pick up and stack containers (one at a time) The available actions are: • • • • • robot r from location l to some adjacent and unoccupied location l ′ take container c with empty crane k from the top of pile p, all located at the same location l put down container c held by crane k on top of pile p, all located at location l load container c held by crane k onto unloaded robot r , all located at location l unload container c with empty crane k from loaded robot r , all located at location l move 20.2 State-space planning vs plan-space (partialorder) planning There are two main approaches to planning: state-space planning and plan-space planning, sometimes called partial-order planning. • • • • • • • • • • • • • state-space planning finite search space explicit representation of intermediate states commits to specific action orderings causal structure only implicit search nodes relatively simple and successors are easy to compute not great at handling goal interactions plan-space planning infinite search space no explicit intermediate states choice of actions and order are independent (no commitment to a particular ordering) explicit representation of rationale search nodes are complex and successors are expensive to compute Nowadays with efficient heuristics, state-space planning is the more efficient way for finding solutions. 20.3 State-space planning With state-space planning, we generate plans by searching through state space. CHAPTER 20. PLANNING 585 20.3. STATE-SPACE PLANNING 20.3.1 586 Representing plans and systems We can use a state-transition system as a conceptual model for planning. Such a system is described by a 4-tuple Σ = (S, A, E, γ), where: • • • • S = {s1 , s2 , . . . } is a finite or recursively enumerable set of states A = {a1 , a2 , . . . } is a finite or recursively enumerable set of actions E = {e1 , e2 , . . . } is a finite or recursively enumerable set of events γ : S × (A ∪ E) → 2S is a state transition function (note 2S is the power set of all states, that is an element of the set is itself a set of world states) If a ∈ A and γ(s, a) ̸= ∅, then a is applicable in s. Applying a in s will take the system to s ′ ∈ γ(s, a). We can also represent such a state-transition system as a directed labelled graph G = (NG , EG ), where: • the nodes correspond to the states in S, i.e. NG = S • there is an arc from s ∈ NG to s ′ ∈ NG , i.e. s → s ′ ∈ EG , with label u ∈ (A ∪ E) if and only if s ′ ∈ γ(s, u). A plan is a structure that gives us appropriate actions to apply in order to achieve some objective when starting from a given state. The objective can be: • • • • a goal state sg or a set of goal states Sg to satisfy some conditions over the sequence of states to optimize a utility function attached to states a task to be performed A permutation of a solution (a plan) is a case in which some actions in the path to the solution can have their order changed without affecting the success of the path (that is, the permuted path still leads to the solution with the same cost). In this case, the actions are said to be independent. Generally we have a planner which generates a plan and the passes the plan to a controller which executes the actions in the plan. The execution of the action then changes the state of the system. The system, however, changes not only via the controller’s actions but also through external events. So the controller must observe the system using an observation function η : S → O and generate the appropriate action. Sometimes, however, there may be parts of the system which we cannot observe, so, given the observations that could be collected, there may be many possible states of the world - this is the belief state of the controller. The system as it actually is is often different from how it was described to the planner (as Σ, which, as an abstraction, loses some details). In dynamic planning, planning and execution are more closely 586 CHAPTER 20. PLANNING 587 20.3. STATE-SPACE PLANNING linked to compensate for this scenario (which is more the rule than the exception). That is the controller must supervise the plan, i.e. it must detect when observations differ from expected results. The controller can pass this information to the planner as an execution status, and then the planner can revise its plan to take into account the new state. 20.3.2 STRIPS (Stanford Research Institute Problem Solver) The STRIPS representation gives us an internal structure to our states, which up until now have been left as black boxes. It is based on first order predicate logic; that is, we have objects in our domain, represented by symbols and grouped according to type, and these objects are related (such relationships are known as predicates) to each other in some way. For example, in the Dock-Worker Robot domain, one type of object is robot and each robot would be represented with a unique symbol, e.g. robot1, robot2, etc. We must specify all of this in a syntax that the planner can understand. The most common syntax is PDDL (Planning Domain Definition Language). For example: (define (domain dock-worker-robot)) (:requirements :strips :typing) (:types location ;there are several connected locations pile ;is attached to a location, ;it holds a pallet and a stack of containers robot ;holds at most 1 container, ;only 1 robot per location crane ;belongs to a location to pickup containers container ) (:predicates (adjacent ?l1 ?l2 - location) ;location ?l1 is adjacent to ?l2 (attached ?p - pile ?l - location) ;pile ?p attached to location ?l (belong ?k - crane ?l - location) ;crane ?k belongs to location ?l (at ?r - robot ?l - location) ;robot ?r is at location ?l (occupied ?l - location) ;there is a robot at location ?l (loaded ?r - robot ?c - container) ;robot ?r is loaded with container ?c (unloaded ?r - robot) ;robot ?r is empty (holding ?k - crane ?c - container) ;crane ?k is holding a container ?c (empty ?k - crane) ;crane ?k is empty (in ?c - container ?p - pile) ;container ?c is within pile ?p (top ?c - container ?p - pile) ;container ?c on top of pile ?p (on ?c1 ?c2 - container) ;container ?c1 is on container ?c2 CHAPTER 20. PLANNING 587 20.3. STATE-SPACE PLANNING 588 ) ) Let L be a first-order language with finitely many predicate symbols, finitely many constant symbols, and no function symbols (e.g. as defined with PDDL above). A state in a STRIPS planning domain is a set of ground atoms of L. An atom is a predicate with an appropriate number of objects (e.g. those we defined above). An atom is ground if all its objects are real objects (rather than variables). • (ground) atom p holds in state s if and only if p ∈ s (this is the closed world assumption); i.e. it is “true” • s satisfies a set of (ground) literals (a literal is an atom that is either positive or negative, e.g. an atom or a negated atom) g (denoted s ⊨ g) if: • every positive literal in g is in s • every negative literal in g is not in s Say we have the symbols loc1, loc2, p1, p2, crane1, r1, c1, c2, c3, pallet. An example state for the DWR problem: state = { adjacent(loc1, loc2), adjacent(loc2, loc1), attached(p1, loc1), attached(p2, loc1), belong(crane1, loc1), occupied(loc2), empty(crane1), at(r1,loc2), unloaded(r1), in(c1,p1), in(c3,p1), on(c3,c1), on(c1,pallet), top(c3,p1), in(c2,p2), on(c2,pallet), top(c2,p2) } In STRIPS, a planning operator is a triple o = name(o), precond(o), effects(o), where: • the name of the operator name(o) is a syntactic expression of the form n(x1 , . . . , xk ) where n is a unique symbol and x1 , . . . , xk are all variables that appear in o (i.e. it is a function signature) • the preconditions precond(o) and the effects effects(o) of the operator are sets of literals (i.e. positive or negative atoms) • the positive effects form the add list 588 CHAPTER 20. PLANNING 589 20.3. STATE-SPACE PLANNING • the negative effects form the delete list An action in STRIPS is a ground instance of a planning operator (that is, we substitute the variables for symbols, e.g. we are “calling” the operator, as in a function). For example, we may have an operator named move(r,l,m) with the preconditions adjacent(l,m), at(r,l), !occupied(m) and the effects at(r,m), occupied(m), !occupied(l), !at(r,l). An action might be move(robot1, loc1, loc2), since we are specifying specific instances to operate on. In the PDDL syntax, this can be written: (:action move :parameters (?r - robot ?from ?to - location) :precondition (and (adjacent ?from ?to) (at ?r ?from) (not (occupied ?to))) :effect (and (at ?r ?to) (occupied ?to) (not (occupied ?from)) (not (at ?r ?from)) )) This is a bit confusing because PDDL does not distinguish “action” from “operator”. 20.3.3 Other representations Representations other than STRIPS includes: • propositional reprsentation: • world state is a set of propositions (i.e. only symbols, no variables) • actions consist of precondition propositions, propositions to be added and removed (i.e. there are no operators b/c we only have symbols) • so the STRIPS representation is essentially propositional representation but with first-order literals instead of propositions (i.e. the preconditions of an operator can be positive or negative) • state-variable representation: • state is a tuple of state variables {x1 , . . . , xn } • an action is a partial function over states These representations, however, can all be translated between each other. 20.3.4 Applicability and state transitions When is an action applicable in a state? Let L be the set of literals. L+ is the set of atoms that are positive literals in L and L− is the set of all atoms whose negations are in L. Let a be an action and s a state. a is applicable in s if and only if: CHAPTER 20. PLANNING 589 20.3. STATE-SPACE PLANNING 590 • precond+ (a) ⊆ s • precond− (a) ∩ s = ∅ Which just says all positive preconditions must be true in the current state, and all negative preconditions must be false in the state. The state transition function γ for an applicable action a in state s is defined as: γ(s, a) = (s − effects− (a)) ∪ effects+ (a) That is, we apply the delete list (remove those effects from the state) and apply the add list (add those effects to the state). Finding actions applicable for a given state is a non-trivial problem, in particular because there may be many, many available actions. We can define an algorithm which will find the applicable actions for a given operator in a given state: • • • • • • • • initialize: A is a set of actions, initially empty op is the operator precs is the list of remaining preconditions to be satisfied v is the substitutions for the variables of the operator s is the given state function addApplicables(A, op, precs, v, s) if no positive preconditions remaining – for every negative precondition np in s – if the state falsifies the np, return – add v(op) to A • else: – – – – select the next positive precondition pp for each proposition sp in s extend ‘v’ such that pp and sp match, the result is v’ if v’ is valid, then: * addApplicables(A, op, (precs - pp), v’, s) We can formally define a planning domain in STRIPS. Given our function-free first-order language L, a STRIPS planning domain on L is a restricted (meaning there are no events) state-transition system Σ = (S, A, γ) such that: • S is a set of STRIPS states, i.e. sets of ground atoms • A is a set of ground instances of some STRIPS planning operators O (i.e. actions) 590 CHAPTER 20. PLANNING 591 • • • • 20.3. STATE-SPACE PLANNING γS × A → S where γ(s, a) = (s − effects− (a)) ∪ effects+ (a) if a is applicable in s γ(s, a) = undefined otherwise S is closed under γ We can formally define a planning problem as a triple P = (Σ, si , g), where: • Σ is the STRIPS planning domain (as described above) • si ∈ S is the initial state • g is a set of ground literals describing the goal such that the set of goal states is Sg = {s ∈ S|s ⊨ g} (as a reminder, s ⊨ g means s satisfies g) In PDDL syntax, we can define the initial state like so: (:init (adjacent l1 l2) (adjacent l2 l1) ;etc ) and the goal like so: (:goal (and (in c1 p2) (in c2 p2) ;etc )) We formally define a plan as any sequence of actions π = a1 , . . . , ak where k ≥ 0: • The length of a plan π is |π| = k, i.e. the number of actions • If π1 = a1 , . . . , ak and π2 = a1′ , . . . , aj′ are plans, their concatenation is the plan π1 · π2 = a1 , . . . , ak , a1′ , . . . , aj′ • The extended state transition function for plans is defined as follows: • γ(s, π) = s if k = 0 (that is, if π is empty) • γ(s, π) = γ(γ(s, a1 ), a2 , . . . , ak ) if k > 0 and a1 is applicable in s • γ(s, π) = undefined otherwise A plan π is a solution for a planning problem P if γ(si , π) satisfies g. A solution π is redundant if there is a proper subsequence of π that is also a solution for P. A solution π is minimal if no other solution for P contains fewer actions that π. CHAPTER 20. PLANNING 591 20.3. STATE-SPACE PLANNING 20.3.5 592 Searching for plans Forward Search The basic idea is to apply standard search algorithms (e.g. bread-first, depth-first, A*, etc) to the planning problem. • • • • search space is a subset of the state space nodes correspond to world states arcs correspond to state transitions path in the search space corresponds to plan Forward search is sound (if a plan is returned, it will indeed be a solution) and it is complete (if a solution exists, it will be found). Backward Search Alternatively, we can search backwards from a goal state to the initial state. First we define two new concepts: An action a ∈ A is relevant for g if: • g ∩ effects(a) ̸= ∅ • g + ∩ effects− (a) = ∅ • g − ∩ effects+ (a) = ∅ Essentially what this is says is the action must contribute the goal (the first item) and the action must not interfere with the goal (the last two items). This is equivalent to applicability. The regression set of g for a relevant action a ∈ A is: γ −1 (g, a) = (g − effects(a)) ∪ precond(a) That is, it is the inverse of the state transition function. When searching backwards, sometimes we end up with operators rather than actions (i.e. some of the parameters are still variables). We could in theory branch out to all possible actions from this operator by just substituting all possible values for the variable, but that will increase the branching factor by a lot. Instead, we can do lifted backward search, in which we just stick with these partially instantiated operators instead of actions. Keeping variables in a plan, such as with lifted backward search, is called least commitment planning. 592 CHAPTER 20. PLANNING 593 20.3. STATE-SPACE PLANNING 20.3.6 The FF Planner The FF Planner performs a forward state-space search (the basic strategy can be A* or enforced hill climbing (EHC, a kind of best-first search where we commit to the first state that looks better than all previous states we have looked at)). It uses a relaxed problem heuristic hF F . The relaxed problem is constructed by ignoring delete list of all the operators. Then we solve this relaxed problem; this can be done in polynomial time: - chain forward to build a relaxed planning graph - chain backward to extract a relaxed plan from the graph Then we use the length (i.e. number of actions) of the relaxed plan as a heuristic value (e.g. for A*). For example, with the simplified DWR from before: • • • • • • • • • move(r,l,l’) precond: at(r,l), adjacent(l,l’) effects: at(r,l), not at(r,l) load(c,r,l) precond: at(r,l), in(c,l), unloaded(r) effects: loaded(r,c), not in(c,l), not unloaded(r) unload(c,r,l) precond: at(r,l), loaded(r,c) effects: unloaded(r), in(c,l), not loaded(r,c) To get the relaxed problem, we drop all delete lists: • • • • • • • • • move(r,l,l’) precond: at(r,l), adjacent(l,l’) effects: at(r,l), not at(r,l) load(c,r,l) precond: at(r,l), in(c,l), unloaded(r) effects: loaded(r,c) unload(c,r,l) precond: at(r,l), loaded(r,c) effects: unloaded(r), in(c,l) Pseudocode for computing the relaxed planning graph (RPG): • function computeRPG(A, si , g) • F0 = si , t = 0 • while g ⊊ Ft do – t =t +1 – At = {a ∈ A|precond(a) ⊆ Ft } CHAPTER 20. PLANNING 593 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING – – – – 594 Ft = Ft−1 for all a ∈ At do Ft = Ft ∪ effects+ (a) if Ft = Ft−1 then return failure • return [F0 , A1 , F1 , . . . , At , Ft ] Pseudocode for extracting a plan from the RPG (in particular, the size of the plan, since this is a heuristic calculation): • • • • function extractRPSize([F0 , A1 , F1 , . . . , Ak , Fk ], g) if g ⊊ Fk then return failure M = max{firstlevel(gi , [F0 , . . . , Fk ])|gi ∈ g} for t = 0 to M do – Gt = {gi ∈ g|firstlevel(gi , [F0 , . . . , Fk ]) = t} • for t = M to 1 do – for all gt ∈ Gt do – select a : firstlevel(a, [A1 , . . . , At ]) = t and gt ∈ effects+ (a) – for all p ∈ precond(a) do * Gfirstlevel(p,[F0 ,...,Fk ]) = Gfirstlevel(p,[F0 ,...,Fk ]) ∪ {p} • return number of selected actions The firstlevel function tells us which layer (by index) a goal gi first appears in the planning graph. This heuristic is not admissible (it is not guaranteed to return a minimal plan), but in practice it is quite accurate, so it (or ideas inspired by it) are frequently used (currently state-of-the-art). 20.4 Plan-space (partial-order) planning Partial plans are like plans mentioned thus far (i.e. simple sequences of actions), but we also record the rationale behind each action, e.g. to achieve the precondition of another action. In aggregate these partial plans may form the solution to the problem (i.e. they act as component plans). We also have explicit ordering constraints (i.e. these actions must occur in this order), so we can have partial plans with partial order, which means that we can execute actions in parallel. And as with lifted backward search, we may have variables in our actions as well. We adjust the planning problem a bit - instead of achieving goals, we want to accomplish tasks. Tasks are high-level descriptions of some activity we want to execute - this is typically accomplished by decomposing the high-level task into lower-level subtasks. This is the approach for Hierarchical Task Network (HTN) planning. There is a simpler version called STN planning as well. 594 CHAPTER 20. PLANNING 595 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING Rather than searching through the state-space, we search through plan-space - a graph of partial plans. The nodes are partially-specified plans, the arcs are plan refinement operations (which is why this is called refinement planning), and the solutions are partial-order plans. More concretely, if a plan is a set of actions organized into some structure, then a partial plan is: • • • • • a subset of the actions a subset of the organizational structure temporal ordering of actions rationale: what the action achieves in the plan a subset of variable bindings More formally, we define a partial plan as a tuple π = (A, ≺, B, L) where: • A = {a1 , . . . , ak } is a set of partially-instantiated planning operators • ≺ is a set of ordering constraints on A of the form (ai ≺ aj ) • B is a set of binding constraints on the variables of actions in A of the form x = y , x ̸= y or x ∈ Dx • L is a set of causal links of the form ai → [p] → aj such that: • ai , aj are actions in A • the constraint (ai ≺ aj ) is in ≺ • the proposition p is an effect of ai and a precondition of aj • the binding constraints for variables in ai and aj appearing p are in B Note that for causal links, ai is the producer in the causal link and aj is the consumer. 20.4.1 Plan refinement operations Adding actions With least-commitment planning, we only want to add actions (more specifically, we are adding partially-instantiated operators) if it’s justified: • to achieve unsatisfied preconditions • to achieve unsatisfied goal conditions Actions can be added anywhere in the plan. Note that each action that we add has its own set of variables, unrelated to those of other actions. Adding causal links Causal links link a provider (an effect of an action or an atom that holds in the initial state) to a consumer (a precondition of an action or a goal condition). There is an ordering constraint here as well, in that the provider must come before the consumer (but not necessarily directly before). We add causal links to prevent interference with other actions. CHAPTER 20. PLANNING 595 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING 596 Adding variable bindings A solution plan must have actions, not partially-instantiated operators. Variable bindings are what allow us to turn operators into actions. Variable bindings constraints keep track of possible values for variables, and also can specify codesignation (i.e. that certain variables must or must not have the same value). For example, with causal links, there are variables in the producer that must be the same as those corresponding variables in the consumer (because they are “carried over”). When two variables must share the same value, we say they are unified. Adding ordering constraints Ordering constraints are just binary relations specifying the temporal order between actions in a plan. Ordering constraints help us avoid possible interference. Causal links imply ordering constraints, and some trivial ordering constraints are that all actions must come after the initial state and before the goal. 20.4.2 The Plan-Space Search Problem The initial search state includes just the initial state and the goal as “dummy actions”: • an init action with no preconditions and with the initial state as its effects • a goal action with the goal conditions as its preconditions and with no effects We start with the empty plan: π0 = ({init, goal}, {(init ≺ goal)}, {}, {}) It includes just the two dummy actions, one ordering constraint (init before goal), and no variable bindings or causal links. We generate successors through one or more plan refinement operators: • • • • adding adding adding adding 20.4.3 an action to A an ordering constraint to ≺ a binding constraint to B a causal link to L Threats and flaws A threat in a partial plan is when we have an action that might occur in parallel with a causal link and has an effect that is complimentary to the condition we want to protect (that is, it interferes with a condition we want to protect). We can often get around this by introducing a new causal link that requires this conflicting action to follow the causal link, instead of occurring in parallel. More formally, an action ak in a partial plan is a thread to a causal link ai → [p] → aj if and only if: 596 CHAPTER 20. PLANNING 597 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING • ak has an effect ̸ q that is possibly inconsistent with p, i.e. q and p are unifiable • the ordering constraints (ai ≺ ak ) and (ak ≺ aj ) are consistent with ≺ • the binding constraints for the unification q and p are consistent with B That is, if we have one action which produces the precondition p, which is what we want, but another action which simultaneously produces the precondition ̸ q where q and p are unifiable, then we have a threat. A flaw in a partial plan is either: • an unsatisfied subgoal, e.g. a precondition of an action in A without a causal link that supports it, or • a threat 20.4.4 Partial order solutions We consider a plan π = (A, ≺, B, L) as partial order solution for a planning problem P if: • its ordering constraints ≺ are not circular • its binding constraints B are consistent • it is flawless 20.4.5 The Plan-Space Planning (PSP) algorithm The main principle is to refine the partial plan π while maintaining ≺ and B consistent until π has no more flaws. Basic operations: • • • • • find the flaws of π select a flaw find a way of resolving the chosen flaw choose one of the resolvers for the flaw refine π according to the chosen resolver The PSP procedure is sound and complete whenever π0 can be refined into a solution plan and PSP(π0 ) returns such a plan. To clarify, soundness: if PSP returns a plan, it is a solution plan, and completeness: if there is a solution plan, PSP will return it. Proof: • soundness: ≺ and B are consistent at every stage of refinement • completeness: induction on the number of actions in the solution plan CHAPTER 20. PLANNING 597 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING 598 The general algorithm: function PSP(plan): all_flaws = plan.open_goals() + plan.threats() if not all_flaws: return plan flaw = all_flaws.select_one() all_resolvers = flaw.get_resolvers(plan) if not all_resolves: return failure resolver = all_resolvers.choose_one() new_plan = plan.refine(resolver) return PSP(new_plan) Where the initial plan is the empty plan as described previously. is a non-deterministic decision (i.e. this is something we may need to backtrack to; that is, if one resolver doesn’t work out, we need to try another branch). all_resolvers.choose_one is a deterministic selection (we don’t need to backtrack to this because all flaws must be resolved). The order is not important for completeness, but it is important for efficiency. all_flaws.select_one Implementing plan.open_goals(): we find unachieved subgoals incrementally: • the goal conditions π0 are the initial unachieved subgoals • when adding an action: all preconditions are unachieved subgoals • when adding a causal link, the protected proposition is no longer unachieved Implementing plan.threats(): we find threats incrementally: • no threats in the goal conditions π0 • when adding an action anew to π = (A, ≺, B, L): • for every causal link (ai → [p] → aj ) ∈ L – if (anew ≺ ai ) or (aj ≺ anew ), then next link – else for every effect q of anew – if ∃σ : σ(p) = σ(̸ q)) then q of anew threatens (ai → [p] → aj ) • when adding a causal link (ai → [p] → aj ) to π = (A, ≺, B, L): • for every action aold ∈ A – if (aold ≺ ai ) or (aj ≺ aold ), then next action – else for every effect q of aold – if ∃σ : σ(p) = σ(̸ q)) then q of aold threatens (ai → [p] → aj ) Implementing flaw.get_resolvers(plan): for an unachieved precondition p of ag : 598 CHAPTER 20. PLANNING 599 20.4. PLAN-SPACE (PARTIAL-ORDER) PLANNING • add causal links to an existing action: • for every action aold ∈ A, see if an existing action can be a provider for this precondition – if (ag = aold ) or (ag ≺ aold ) then next action (i.e. if the existing action is the consumer or it comes after the consumer, then it cannot be a producer, so just move on to the next action) – else for every effect q of aold , check if the existing action produces a precondition that is equal to the unachieved precondition: – if ∃σ : σ(p) = σ(q)) then adding aold → [σ(p)] → ag is a resolver • add a new action and a causal link (i.e. create a new provider): • for every effect q of every operator o – if ∃σ : σ(p) = σ(q)) then adding anew = o.newInstance() and anew → [σ(p)] → ag is a resolver For an effect q of action at threatening ai → [p] → aj : • • • • • • • • order action before threatened link: if (at = ai ) or (aj ≺ at ), then not a resolver else adding (at ≺ ai ) is a resolver order threatened link before action: if (at = ai ) or (at ≺ ai ), then not a resolver else adding (aj ≺ at ) is a resolver extend variable bindings such that unification fails: for every variable v in p or q – if v ̸= σ(v ) is consistent with B then adding v ̸= σ(V ) is a resolver Implementing plan.refine(resolver): refines a partial plan by adding elements specified in resolver, i.e.: • • • • an ordering constraint; one or more binding constraints; a causal link; and/or a new action This may introduce new flaws, so we must update the flaws (i.e. plan.threats()). plan.open_goals() and Implementing ordering constraints: ordering constraint management can be implemented as an independent module with two operations: • querying whether (ai ≺ aj ) • adding (ai ≺ aj ) CHAPTER 20. PLANNING 599 20.5. TASKS 600 Implementing variable binding constraints: Types of constraints: • unary constraints: x ∈ Dx • equality constraints: x = y • inequalities: x ̸= y Unary and equality constraints can be dealt with in linear time, but inequality constraints cause exponential complexity here - with inequalities, this is a general constraint satisfaction problem which is NP-complete. So these variable binding constraints can become problematic. 20.4.6 The UC Partial-Order Planning (UCPoP) Planner This is slight variation to PSP as outlined above, in which threats are instead dealt with after an open goal is dealt with (that is, it deals with threats from the resolver that was used to deal with an open goal, which is to say that threats are resolved as part of the successor generation process). UCPoP takes in a slightly different input as well. In addition to the partial plan, it also takes in an agenda, which is a set of (a, p) pairs where a is an action and p is one of its preconditions (i.e. this is a list of things that still need to be dealt with, that is the remaining open goal flaws). 20.5 Tasks With task network planning, we still have terms, literals, operators, actions, state transition functions, and plans. A few new things are added: • the tasks to be performed • methods describing ways in which tasks can be performed • organized collections of tasks called task networks Formally: • we have task symbols Ts = {t1 , . . . , tn } for giving unique names to tasks • the operator names must be ⊊ Ts (that is, it must be a subset of the task symbols and cannot be equal to the entire set). Operator names that correspond to task symbols are called primitive tasks. • non-primitive task symbols are Ts − operator names, i.e. task symbols with no corresponding operators. • task: ti (r1 , . . . , rk ) • ti is the task symbol (primitive or non-primitive) • r1 , . . . , rk are terms or objects manipulated by the task 600 CHAPTER 20. PLANNING 601 20.5. TASKS • a ground task is one in which all its parameters are ground (i.e. they are actual objects, not variables) • action: a = op(c1 , . . . , ck ), where op is an operator name and c1 , . . . , ck are constants representing parameters for the action, accomplishes ground primitive task ti (r1 , . . . , rk ) in state s if and only if: • name(a) = ti , i.e. the name of the action must be the task symbol, and c1 = r1 and . . . and ck = rk (i.e. the parameters must be the same) • a is applicable in s 20.5.1 Simple Task Networks (STN) We can group tasks into task networks. A simple task network w is an acyclic directed graph (U, E) in which: • the node set U = {t1 , . . . , n} is a set of tasks • the edges in E define a partial ordering of the tasks in U A task network w is ground/primitive if all tasks tu ∈ U are ground/primitive, otherwise it is unground/non-primitive. A task network may also be totally ordered or partially ordered. We have an ordering tu ≺ tv in w if there is a path from tu to tv . A network w is totally ordered if and only if E defines a total order on U (that is, that every node is ordered with respect to every other node in the network). If w is totally ordered, we can represent it as a sequence of tasks t1 , . . . , tn ). Let w = t1 , . . . , tn be a totally ordered, ground, primitive STN. Then the plan π(w ) is defined as: π(w ) = a1 , . . . , an Where ai = ti , 1 ≤ i ≤ n. Simple task networks are a simplification of the more general hierarchical task networks (HTNs). Example (DWR) Tasks: • t1 = take(crane, loc, c1 , c2 , p1 ) (primitive, because we have an operator of that same name in the DWR domain, and ground, because all arguments here are objects and not variables) • t2 = take(crane, loc, c2 , c3 , p1 ) (primitive, ground) • t3 = move-stack(p1 , q) (non-primitive, because we do not have an operator named “movestack” in the DWR domain, and unground because q is a variable) CHAPTER 20. PLANNING 601 20.5. TASKS 602 Task networks: • w1 = ({t1 , t2 , t3 }, {(t1 , t2 ), (t1 , t3 )}) (partially-ordered, because we don’t have an order specified for t2 , t3 , non-primitive, because the non-primitive task t3 is included, and unground, because the unground task t3 is included. • w2 = ({t1 , t2 }, {(t1 , t2 )}) (totally ordered, ground, primitive) • π(w2 ) = t1 , t2 20.5.2 Methods Methods are plan refinements (i.e. they correspond to state transitions in our search space). Let MS be a set of method symbols. A STN method is a 4-tuple m = (name(m), task(m), precond(m), network(m where: • name(m) is the name of the method • it is a syntactic expression of the form n(x1 , . . . , xk ) – n ∈ MS is a unique method symbol – x1 , . . . , xk are all the variable symbols that occur in m • task(m) is a non-primitive task (primitive tasks can just be accomplished by an operator) accomplished by this method • precond(m) is a set of literals; the method’s preconditions • network(m) is a task network (U, E) containing the set of subtasks, which is U, of m Example (DWR) Say we want to define a method which involves taking the topmost container of a stack and moving it. If you recall, we have the operators take and put, which we can use to define this method. The name of the method could be take-and-put(c, k, l, po , pd , xo , xd ). The task this method completes would be move-topmost(po , pd ). The preconditions would be top(c, p0 ), on(c, xo ), attached(po , l), belong( Finally, the subtasks would be take(k, l, c, xo , po ), put(k, l, c, xd , pd ). Where: • • • • • • • 602 c is the container to move k is the crane to use l is the location po is the original pile pd is the destination pile xo is the container from which we are taking c xd is the container which we are placing c on CHAPTER 20. PLANNING 603 20.5. TASKS Applicability and relevance A method instance m is applicable in a state s if: • precond+ (m) ⊆ s • precond− (m) ∩ s = ∅ A method instance m is relevant for a task t if there is a substitution σ such that σ(t) = task(m) Decomposition of tasks The decomposition of an individual task t by a relevant method m under σ is either: • δ(t, m, σ) = σ(network(m)) or • δ(t, m, σ) = σ(subtasks(m)) if m is totally ordered δ is called the decomposition method for a task given a method and a substitution. That is, we break the task down into its subtasks. The decomposition of tasks in a STN is as follows: Let: • w = (U, E) be a STN • t ∈ U be a task with no predecessors in w • m be a method that is relevant for t under some substitution σ with network(m) = (Um , Em ) The decomposition of t in w by m under σ is the STN δ(w , t, m, σ) where: • t is replaced in U by σ(Um ) • edges in E involving t are replaced by edges to appropriate nodes in σ(Um ) That is, we replace the task with its subtasks. 20.5.3 Planning Domains and Problems An STN planning domain is a pair D = (O, M) where: • O is a set of STRIPS planning operators • M is a set of STN methods D is a total-order STN planning domain if every m ∈ M is totally ordered. An STN planning problem is a 4-tuple P = (si , wi , O, M) where: CHAPTER 20. PLANNING 603 20.5. TASKS 604 • si is the initial state (a set of ground atoms) • wi is a task network called the initial task network • D = (O, M) is an STN planning domain So it is quite similar to a STRIPS planning problem (just the domain also includes STN methods and we have a initial task network). P is a total-order STN planning problem if wi and D are both totally ordered. A plan π = a1 , . . . , an is a solution for an STN planning problem P = (si , wi , O, M) if: • wi is empty and π is empty (i.e. if we had no tasks, doing nothing is a solution) or: • there is a primitive task t ∈ wi that has no predecessors in wi (i.e. it is one of the first tasks) • a1 = t is applicable in si • pi ′ = a2 , . . . , an is a solution for P ′ = (γ(si , a1 ), wi − {t}, O, M) (i.e. recurse) or: • there is a non-primitive task t ∈ wi that has no predecessors in wi (i.e. it is one of the first tasks) • a method m ∈ M is relevant for t, i.e. σ(t) = task(m) and applicable in si • π is a solution for P ′ = (si , δ(wi , t, m, σ), O, M) 20.5.4 Planning with task networks The Ground Total order Forward Decomposition (Ground-TFD) procedure function GroundTFD(s1, (t1, ..., tk), O, M): if k=0 return [] if t1.is_primitive() then actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)} if not actions then return failure (a, sigma) = actions.choose_one() # non-deterministic choice, we may need to backtrack to her plan = GroundTFD(gamma(s,a), sigma((t2, ..., tk)), O, M) if plan = failure then return failure else return [a].extend(plan) else: methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)} if not methods then return failure (m, sigma) = methods.choose_one() plan = subtasks(m).extend(sigma((t2, ..., tk))) return GroundTFD(s, plan, O, M) 604 CHAPTER 20. PLANNING 605 20.5. TASKS TFD considers only applicable actions, much like forward search, and it only considers relevant actions, much like backward search. Ground-TFD can be generalized to Lifted-TFD, giving the same advantages as lifted backward search (e.g. least commitment planning). The Ground Partial order Forward Decomposition (Ground-PFD) procedure function GroundPFD(s1, w, O, M): if w.U = {} return [] task = {t in U| t.no_predecessors_in(w.E)}.choose_one() if task.is_primitive() then actions = {(a, sigma) | a = sigma(t1) and a.applicable_in(s)} if not actions then return failure (a, sigma) = actions.choose_one() plan = GroundPFD(gamma(s,a), sigma(w-{task}), O, M) else: methods = {(m, sigma) | m.relevant_for(sigma(t1) and m.applicable_in(s)} if not methods then return failure (m, sigma) = methods.choose_one() return GroundPFD(s, delta(w,task,m,sigma), O, M) 20.5.5 Hierarchical Task Network (HTN) planning HTN planning is more general than STN planning, which also means there is no single algorithm for implementing HTN planning. In STN planning, we had two types of constraints: • • • • • ordering constraints, which were maintained in the network preconditions (constraints on a state before a method or action is applied): enforced as part of the planning procedure must know the state to test for applicability must perform forward search HTN planning has the flexibility to use these constraints or other arbitrary constraints as needed; that is, it maintains more general constraints explicitly (in contrast, with STN planning the constraints are embedded as part of the network or the planning). For instance, we could include constraints on the resources used by the tasks. HTN methods are different than STN methods. Formally: Let MS be a set of method symbols. An HTN method is a 4-tuple m = (name(m), task(m), subtasks(m), constr(m where: • name(m), task(m) are the same as with STN methods CHAPTER 20. PLANNING 605 20.6. GRAPHPLAN 606 • (subtasks(m), constr(m)) is a hierarchical task network, with subtasks (similar to STN methods) but also arbitrary constraints. So the main difference between HTN and STN is that HTN can handle arbitrary constraints, which makes it more powerful, but also more complex. HTN vs STRIPS planning The STN/HTN formalism is more expressive; you can encode more problems with this formalism than you can with STRIPS. For example, STN/HTN planning can encode undecidable problems. However, if you leave out the recursive aspects of STN planning, you can translate such a nonrecursive STN problem into an equivalent STRIPS problem. However the size of the problem may become exponentially larger. There is also a set of STN domains called “regular” STN domains which are equivalent to STRIPS. It is important to note that STN/HTN and STRIPS planning are meant to solve different kinds of problems - STN/HTN for task-based planning, and STRIPS for goal-based planning. 20.6 Graphplan Like STRIPS, a planning problem for Graphplan consists of a set of operators constituting a domain, an initial state, and set of goals that need to be achieved. A major difference is that Graphplan works on a propositional representation, i.e. the atoms that make up the world are no longer structured - they don’t consist of objects and their relationships but of individual symbols (facts about the world) which can be either true or false. Actions are also individual symbols, not parameterized actions. However, note that every STRIPS planning problem can be translated into an equivalent propositional problem, so long as its operators have no negative preconditions. The Graphplan algorithm creates a data structure called a planning graph. The algorithm has two major steps: 1. the planning graph is expanded with two new layers, an action layer and a proposition layer 2. the graph is searched for a plan The initial layer is a layer of propositions that are true in the initial state. Then a layer of actions applicable given this initial layer is added, followed by a proposition layer of those propositions that would be true after these actions (including those that were true before which have not been altered by the actions). This expansion step runs in polynomial time (so it is quite fast). The search step searches backwards, searching from the last proposition layer in the plan and goes backwards to the initial state. The search itself can be accomplished with something like A*. If a plan is not found, the algorithm goes back to the first step. 606 CHAPTER 20. PLANNING 607 20.6. GRAPHPLAN An example Graphplan planning graph, from Jiří Iša Example: Simplified DWR problem: • location 1 • robot r • container a • location 2 • robot q • container b • robots can load and unload autonomously • locations may contain unlimited number of robots and containers • problem: swap locations of containers (i.e. we want container a at location 2 and container b at location 1) Here are the STRIPS operators we could use in this domain: • move(r,l,l’) • precond: at(r,l), adjacent(l,l’) CHAPTER 20. PLANNING 607 20.6. GRAPHPLAN • • • • • • • 608 effects: at(r,l), not at(r,l) load(c,r,l) precond: at(r,l), in(c,l), unloaded(r) effects: loaded(r,c), not in(c,l), not unloaded(r) unload(c,r,l) precond: at(r,l), loaded(r,c) effects: unloaded(r), in(c,l), not loaded(r,c) There are no negative preconditions here so we can translate this into a propositional representation. Basically for each operator, we have to consider every possible configuration (based on the preconditions), and each configuration is given a symbol. For example, for move(r,l,l’), we have two robots and two locations. So there are four possibilities for at(r,l) (i.e. at(robot r, location 1), at(robot q, location 1), at(robot r, location 2), at(robot q, location 2)), and because we have only two locations, there is only one possibility for adjacent(l,l’), so we will have eight symbols for that STRIPS operator in the propositional representation. So for instance, we could use the symbol r1 for at(robot r, location 1), the symbol r2 for at(robot r, location 2) , the symbol ur for unloaded(robot r), and so on. Then we’d represent the initial state like so: {r1, q2, a1, b2, ur, uq}. Our actions are also symbolically represented. For instance, we could use the symbol Mr12 for move(robot r, location 1, location 2). A propositional planning problem P = (Σ, si , g) has a solution if and only if Sg ∩ Γ> ({si }) ̸= ∅, where Γ> ({si }) is the set of all states reachable from the initial state. We can identify reachable states by constructing a reachability tree, where: • the root is the initial state si • the children of a node s are Γ({s}) • arcs are labeled with actions All nodes in the reachability tree is denoted Γ> ({si }). All nodes up to depth d are Γd ({si }). These trees are usually very large: there are O(k d ) nodes, where k is the number of applicable actions per state. So we cannot simply traverse the entire tree. Instead, we can construct a planning graph. A planning graph is a layered (layers as mentioned earlier) directed graph G = (N, E), where: • N = P0 ∪ A1 ∪ P1 ∪ A2 ∪, . . . • P0 , P1 , . . . are state proposition layers • A1 , A2 , . . . are action layers 608 CHAPTER 20. PLANNING 609 20.6. GRAPHPLAN The first proposition layer P0 has the propositions in the initial state si . An action layer Aj has all actions a where precond(a) ⊆ Pj−1 . A proposition layer Pj has all propositions p where p ∈ Pj−1 or ∃a ∈ Aj : p ∈ effects+ (a). Note that we do not look at negative effects; we never remove negative effects from a layer. As a result, both proposition layers and action layers increase (grow larger) monotonically as we move forward through the graph. We create arcs throughout the graph like so: • • • • • from proposition p ∈ Pj−1 to action a ∈ Aj if p ∈ precond(a) from action a ∈ Aj to layer p ∈ Pj positive arc if p ∈ effects+ (a) negative arc if p ∈ effects− (a) no arcs between other layers If a goal g is reachable from the initial state si , then there will be some proposition layer Pg in the planning graph such that g ⊆ Pg . This is a necessary condition, but not sufficient because the planning graph’s proposition layers contain propositions that may be true depending on the selected actions in the previous action layer; furthermore, these proposition layers may contain inconsistent propositions (e.g. a robot cannot be in two different locations simultaneously). Similarly, actions in an action layer may not be applicable at the same time (e.g. a robot cannot move to two different locations simultaneously). The advantage of the planning graph is that it is of polynomial size and we can evaluate this necessary condition in polynomial time. 20.6.1 Action independence Actions which cannot be executed simultaneously/in parallel are dependent, otherwise, they are independent. More formally: Two actions a1 and a2 are independent if and only if: • effects− (a1 ) ∩ (precond(a2 ) ∪ effects+ (a2 )) = ∅ and • effects− (a2 ) ∩ (precond(a1 ) ∪ effects+ (a1 )) = ∅ (that is, they don’t interfere with each other) A set of actions π is independent if and only if every pair of actions a1 , a2 ∈ π is independent. In pseudocode: CHAPTER 20. PLANNING 609 20.6. GRAPHPLAN 610 function independent(a1, a2): for each p in neg_effects(a1): if p in precond(a2) or p in pos_effects(a2): return false for each p in neg_effects(a2): if p in precond(a1) or p in pos_effects(a1): return false return true A set π of independent actions is applicable to a state s if and only if ∪ a∈π precond(a) ⊆ s. The result of applying the set π in s is defined as γ(s, π) = (s − effects − (π)) ∪ effects+ (π), where: ∪ • precond(π) = a∈π precond(a) ∪ • effects+ (π) = a∈π effects+ (a) ∪ • effects− (π) = a∈π effects− (a) 20.6.2 Independent action execution order To turn a set of independent actions into a sequential plan: If a set π of independent actions is applicable in state s, then for any permutation a1 , . . . , ak of the elements of π: • the sequence a1 , . . . , ak is applicable to s • the state resulting from the application of π to s is the same as from the application of a1 , . . . , ak , i.e. γ(s, π) = γ(s, (a1 , . . . , ak )). Which is to say that the execution order doesn’t matter for a set of independent actions, we can execute these actions in any order we like (intuitively, this makes sense, because none of them interfere with each other, so they can happen in any order). 20.6.3 Layered plans Let P = (A, si , g) be a statement of a propositional planning problem and let G = (N, E) be the planning graph as previously defined. A layered plan over G is a sequence of sets of actions Π = π1 , . . . , πk where: • π i ⊆ Ai ⊆ A • πi is applicable in state Pi−1 • the actions of πi are independent A layered plan Π is a solution to a planning problem P if and only if: • π1 is applicable in si • for j ∈ {2, . . . , k}, pij is applicable in state γ(. . . γ(γ(si , π1 ), π2 ), . . . πj−1 ) • g ⊆ γ(. . . γ(γ(si , π1 ), π2 ), . . . , πk ) 610 CHAPTER 20. PLANNING 611 20.6.4 20.6. GRAPHPLAN Mutual exclusivity (mutex) In addition to the nodes and edges thus far mentioned, there are also mutual exclusivity (mutex) propositions and actions. Two propositions in a layer may be incompatible if: • the only actions which produce them are dependent actions • they are the positive and negative effects of the same action We introduce a No-Op operation for a proposition p, notated Ap. They carry p from one proposition layer to the next. Thus their only precondition is p, and their only effect is p. These were implied previously when we said that all propositions are carried over to the next proposition layer; we are just making them explicit through these No-Op actions. With these no-op actions, we can now encode the second reason for incompatible propositions (i.e. if they are positive and negative effects of the same action) as the first (i.e. they are produced by dependent actions), the dependent actions now being the no-op action and the original action. We say that two propositions p and q in proposition layer Pj are mutex (mutually exclusive) if: • every action in the preceding action layer Aj that has p as a positive effect (including no-op actions) is mutex with every action in Aj that has q as a positive effect • there is no single action in Aj that has both p and q as positive effects Notation: µPj = {(p, q)|p, q ∈ Pj are mutex} In pseudocode, this would be: function mutex_proposition(p1, p2, mu_a_j): for each a1 in p1.producers: for each a2 in p2.producers: if (a1, a2) not in mu_a_j: return false return true See below for the definition mu_a_j. Two actions a1 and a2 in action layer Aj are mutex if: • a1 and a2 are independent, or • a precondition of a1 is mutex with a precondition of a2 Notation: µAj = {(a1 , a2 )|a1 , a2 ∈ Aj are mutex} In pseudocode, this would be: CHAPTER 20. PLANNING 611 20.6. GRAPHPLAN 612 function mutex_action(a1, a2, mu_P): if not independent(a1, a2): return true for each p1 in precond(a1): for each p2 in precond(a2): if (p1, p2) not in mu_P: return true return false How do mutex relations propagate through the planning graph? If p, q ∈ Pj−1 and (p, q) ∈ / µPj−1 then (p, q) ∈ / µPj . Proof: • if p, q ∈ Pj−1 then Ap, Aq ∈ Aj (reminder: Ap, Aq are no-op operations for p, q respectively) • if (p, q) ∈ / µ¶j−1 then (Ap, Aq) ∈ / µAj • since Ap, Aq ∈ Aj and (Ap, Aq) ∈ / µAj , (p, q) ∈ µPj must hold If a1 , a2 ∈ Aj−1 and (a1 , a2 ) ∈ / µAj−1 then (a1 , a2 ) ∈ / µAj . Proof: • • • • • if a1 , a2 ∈ Aj−1 and (a1 , a2 ) ∈ / µAj−1 then a1 , a2 are independent their preconditions in Pj−1 are not mutex both properties remain true for Pj hence a1 , a2 ∈ Aj and (a1 , a2 ) ∈ / µAj So mutex relations decrease in some sense further down the planning graph. Actions with mutex preconditions p and q are impossible, and as such, we can remove that action from the graph. 20.6.5 Forward planning graph expansion This is the process of growing the planning graph. Theoretically, the planning graph is infinite, but we can set a limit on it given a planning problem P = (A, si , g). If g is reachable from si , then there is a proposition layer Pg such that g ⊆ Pg and ̸ ∃g1 , g2 ∈ g : (g1 , g2 ) ∈ µPg (that is, there are no pairs of goal propositions that are mutually exclusive in the proposition layer Pg ). The basic idea behind the Graphplan algorithm: • expand the planning graph, one action layer and one proposition layer at a time 612 CHAPTER 20. PLANNING 613 20.6. GRAPHPLAN • stop expanding when we reach the first graph for which Pg is the last proposition layer such that: • g ⊆ Pg • ̸ ∃g1 , g2 ∈ g : (g1 , g2 ) ∈ µPg • search backwards from the last proposition layer Pg for a solution Pseudocode for the expand step: • • • • • • function expand(Gk−1 ) Ak = {a ∈ A|precond(a) ⊆ Pk−1 and{(p1 , p2 )|p1 , p2 ∈ precond(a)} ∩ µPk−1 = ∅} µAk = {(a1 , a2 )|a1 , a2 ∈ Ak , a1 ̸= a2 , andmutex(a1 , a2 , µPk−1 )} Pk = {p|∃a ∈ Ak : p ∈ effects+ (a)} µPk = {(p1 , p2 )|p1 , p2 ∈ Pk , p1 ̸= p2 , andmutex(p1 , p2 , µAk )} for all a ∈ Ak – prek = prek ∪ ({p|p ∈ Pk−1 andp ∈ precond(a)} × a) – ek+ = ek+ ∪ (a × {p|p ∈ Pk andp ∈ effects+ (a)}) – ek− = ek− ∪ (a × {p|p ∈ Pk andp ∈ effects− (a)}) The size of a planning graph up to level k and the time required to expand it to that level are both polynomial in the size of the planning problem. Proof: given a problem size of n propositions and m actions, |Pj | ≤ n, |Aj | ≤ n + m, including no-op actions. The algorithms for generating each layer and all link types are polynomial in the size of the layer and we have a linear number of layers k. Eventually a planning graph G will reach a fixed-point level, which is the kth level such that for all i , i > k, level i of G is identical to level k, i.e. Pi = Pk , µPi = µPk , Ai = Ak , µAi = µAk . This is because as Pi grows monotonically, µPi shrinks monotonically, and Ai , Pi depend only on Pi−1 , µPi−1 . 20.6.6 Backward graph search This is just a depth-first graph search, starting from the last proposition layer Pk , where the search nodes are subsets of nodes from the different layers. The general idea: • let g be the set of goal propositions that need to be achieved at a given proposition layer Pj (starting with the last layer) • find a set of actions πj ⊆ Aj such that these actions are not mutex and together achieve g • take the union of the preconditions of πj as the new (sub)goal set to be achieved in the proposition layer Pj−1 CHAPTER 20. PLANNING 613 20.6. GRAPHPLAN 614 When implementing this search, we want to keep track of, for each proposition layer, which set of subgoals have failed the search (i.e. are unachievable). The motivation is that if we search and run into a failure state, we have to backtrack, and we don’t want to end up in the same failure state some other way. We use a nogood table ∇ to keep track of what nodes we’ve seen before. Up to layer k, the nogood table is an array of k sets of sets of goal propositions. That is, for each layer we have a set of sets. The inner sets represent a single combination of propositions that cannot be achieved. The outer set, then, contains all combinations of propositions that cannot be achieved for that layer. So before searching for the set g in a proposition layer Pj , we first check whether or not g ∈ ∇(j); that is, check to see g has already been determined unachievable for Pj . Otherwise, if we do search for g in Pj and find that it is unachievable, we add g to ∇(j). For this backward search, we define a function extract: • • • • • • • function extract(G, g, i ) if i = 0 then return () if g ∈ ∇(i ) then return failure Π = gpSearch(G, g, {}, i ) if Π ̸= failure then return Π ∇(i ) = ∇(i ) + g return failure The function gpSearch is defined as: • function gpSearch(G, g, π, i ) • if g = {} then ∪ – Π = extract(G, a∈π precond(a), i − 1) – if Π = failure then return failure – return concat(Π, (π)) • • • • • p = g.selectOneSubgoal() providers = {a ∈ Ai |p ∈ effects+ (a)and ̸ ∃a′ ∈ π : (a, a′ ) ∈ µAi } if providers = {} then return failure a = providers.chooseOne() (we may need to backtrack to here) return gpSearch(G, g − effects+ (a), π + a, i ) We can combine everything into the complete Graphplan algorithm: • function graphplan(A, si , g) • i = 0, ∇ = [], P0 = si , G = (P0 , {}) • while (g ⊊ Pi org 2 ∩ µPi ̸= ∅) and ̸ fixedPoint(G) do – i = i + 1; expand(G) 614 CHAPTER 20. PLANNING 615 • • • • 20.7. OTHER CONSIDERATIONS if g ⊊ Pi or g 2 ∩ µPi ̸= ∅ then return failure η = fixedPoint(G)?|∇(k)| : 0 (ternary operator) Π = extract(G, g, i ) while Π = failure do – – – – – i = i + 1; expand(G) Π = extract(G, g, i ) if Π = failure and fixedPoint(G) then if η = |∇(k)| then return failure η = |∇(k)| • return Π The Graphplan algorithm is sound, complete, and always terminates. The plan that is returned (if the planning problem has a solution, otherwise, no plan is returned) will have a minimal number of layers, but not necessarily a minimal number of actions. It is orders of magnitude faster than the previously-discussed techniques due to the planning graph structure (the backwards search still takes exponential time though). 20.7 Other considerations 20.7.1 Planning under uncertainty Thus far all the approaches have assumed the outcome of actions are deterministic. However, the outcome of actions are often uncertain, so the resulting state is uncertain. One approach is belief state search. A belief state is a set of world states, one of which is the true state, but we don’t know which one. The resulting solution plan is a sequence of actions. Another approach is contingency planning. The possible outcomes of an action are called contingencies. The resulting solution plan is a tree that branches according to contingencies. At branching points, observation actions are included to see which branch is actually happening. Both of these approaches are naturally way more complex due to multiple possible outcomes for actions. If we can quantify the degree of uncertainty (i.e. we know the probabilities of different outcomes given an action) we have, we can use probabilistic planning. Instead of simple state transition systems, we use a partially observable Markov decision process (POMDP) as our model: • a set of world states S • a set of actions A, actions are applicable in certain states: s ∈ S : A(s) ⊆ A • cost function, gives the cost of an action in a given state: c(a, s) > 0 for s ∈ S and a ∈ A CHAPTER 20. PLANNING 615 20.7. OTHER CONSIDERATIONS 616 • transition probabilities: Pa (s ′ |s) for s, s ′ ∈ S and a ∈ A (probability of state s ′ when executing action a in state s) • initial belief state (probability distribution over all states in S) • final belief state (corresponds to the goal) • solution (called a “policy”): a function from states to actions, i.e. given a state, this is the action we should execute • we want the optimal policy, i.e. the policy with the minimal expected cost 20.7.2 Planning with time So far we have assumed actions are instantaneous but in reality, they take time. That is, we should assume that our actions are durative (they take time). We can assume that actions take a known amount of time, with a start time point and an end time point. With A* we can include time as part of an action’s cost. With partial plans (e.g. HTN) we can use a temporal constraint manager, which could be: • time point networks: associates all time points in a given plan, where we assert relations between different time points (e.g. that t1 < t2 ) • interval algebra: instead of relating time points, we relate the intervals that correspond to the action execution (e.g. we assert that interval i1 must occur before i2 or that i3 must occur during i4 ) One way: We can specify an earliest start time ES(s) and a latest start time LS(s) for each task/state in the network. We define ES(s0 ) = 0. For any other state: ES(s) = max ES(A) + duration(A) A→a Where A → s just denotes each predecessor state of s. We define LS(sg ) = ES(sg ). For any other state: LS(s) = min LS(B) − duration(s) s→B Where s → B just denotes each successor state of s. 616 CHAPTER 20. PLANNING 617 20.7. OTHER CONSIDERATIONS 20.7.3 Multi-agent planning So far we have assumed planners working in isolation, with control over all the other agents in the plan. Planners may need to work in concert with other planners or other agents (e.g. people); such scenarios introduce a few new concerns, such as how plans and outcomes are represented and communicated across agents. Multi-agent planning does away with the assumption of an isolated planner, and results in a much more complex problem: • • • • • agents with different beliefs agents with different capabilities agents with joint goals agents with individual/conflicting goals joint actions (multiple coordinating agents required to accomplish an action) Other things that must be considered during execution of multi-agent plans: • coordination (ordering constraints, sharing resources, joint actions) • communication (e.g. communicating results) • execution failure recovery (local plan repair, propagating changes to the plan across agents) 20.7.4 Scheduling: Dealing with resources Actions need resources (time can also be thought of as a resource). Planning which deals with resources is known as scheduling. A resource is an entity needed to perform an action, which are described by resource variables. A distinction between state and resource variables: • state variables: modified by actions in absolute ways • resource variables: modified by actions in relative ways Some resource types include: • • • • • reusable vs consumable discrete vs continuous unary (only one available) shareable resources with states Planning approaches can be arranged in a table: CHAPTER 20. PLANNING 617 20.8. LEARNING PLANS 618 Deterministic Fully observable Stochastic A*, depth-first, breadth-first MDP Partially observable POMDP 20.8 Learning plans There are a few ways an agent can learn plans. Presented here are some simpler ones; see the section on Reinforcement Learning for more sophisticated methods. 20.8.1 Apprenticeship We can use machine learning to learn how to navigate a state space (i.e. to plan) by “watching” another agent (an “expert”) perform. For example, if we are talking about a game, the learning agent can watch a skilled player play, and based on that learn how to play on its own. In this case, the examples are states s, the candidates are pairs (s, a), and the “correct” actions are those taken by the exprt. We define features over (s, a) pairs: f (s, a). The score of a q-state (s, a) is given by w · f (s, a). This is basically classification, where the inputs are states and the labels are actions. 20.8.2 Case-Based Goal Formulation With case-based goal formulation, a library of cases relevant to the problem is maintained (e.g. with RTS games, this could be a library of replays for that game). Then the agent uses this library to select a goal to pursue, given the current world state That is, the agent finds the state case q (from a case library L) most similar to the current world state s: q = argmin distance(s, c) c∈L Where the distance metric may be domain independent or domain specific. Then, the goal state g is formulated by looking ahead n actions from q to a future state in that case q ′ , finding that difference, and adding that to the current world state s: g = s + (q ′ − q) The number of actions n is called the planning window size. A small planning window is better for domains where plans are invalidated frequently. 618 CHAPTER 20. PLANNING 619 20.9. REFERENCES 20.9 References • Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Logical Foundations of Artificial Intelligence (1987) (Chapter 12: Planning) • Planning Algorithms. Steven M. LaValle. 2006. • Artificial Intelligence Planning. Dr. Gerhard Wickler, Prof. Austin Tate. The University of Edinburgh (Coursera). 2015. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • Comparison of State-Space Planning and Plan-Space Planning. Planning (Fall 2001). Carnegie Mellon University. Manuela Veloso. CHAPTER 20. PLANNING 619 20.9. REFERENCES 620 620 CHAPTER 20. PLANNING 621 21 Reinforcement learning A quick refresher - a Markov Decision Process (MDP) involves a set of states and actions. Actions connect states with some uncertainty described by the dynamics P (s ′ |a, s) (i.e. transition probabilities which underlie the transition function T (s, a, s ′ )) Additionally is a reward function R(s, a, s ′ ) that associates a reward or penalty with each state. When these are known or learned the result is a policy π which prescribes actions to take given a state. We can then value a given state s and a policy π in terms of expected future rewards with a value function V π (s). So the ultimate goal here is to identify an optimal policy. Markov Decision Processes as described so far have been fully-observed in the sense that we knew all of their parameters (transition probabilities and so on). Because everything was known in advance, we could conduct offline planning, that is, formulate a plan without needing to interact with the world. MDP parameters aren’t always known from the onset - we may not know the reward function R or even the transition model T , and then we must engage in online planning, in which we must interact with the world to learn more about it to better formulate a plan. Online planning involves reinforcement learning, where agents can learn in what states rewards or goals are located without needing to know from the start. Reinforcement learning in summary: • • • • the agent interacts with its environment and receives feedback in the form of rewards the agent’s utility is defined by the reward function the agent must learn to act so as to maximize expected rewards learning is based on observed samples of outcomes The ultimate goal of reinforcement learning is to learn a policy which returns an action to take given a state. To form a good policy we need to know the value of a given state; we do so by learning a value function which is the sum of rewards from the current state to some terminal state, following a CHAPTER 21. REINFORCEMENT LEARNING 621 21.1. MODEL-BASED LEARNING 622 fixed policy. This value function can be learned and approximated by any learning and approximation approach, e.g. neural networks. One challenge with reinforcement learning is the credit assignment problem - a much earlier action could be responsible for the current outcome, but how is that responsibility assigned? And how it is quantified? With reinforcement learning, we still assume a MDP, it’s just not fully specified - that is, we don’t know R and we might not know T . If we do know T , a utility-based agent can learn R and thus V , which we can then use for MDP. If T and R are both unknown, a Q-learning agent can learn Q(s, a) without needing either. Where V (s) is the value over states, Q(s, a) is the value over state-action pairs and can also be used with MDP. A reflex agent can also directly learn the policy π(s) without needing to know T or R. Reinforcement learning agents can be passive, which means the agent has a fixed policy and learns R and T (if necessary) while executing that policy. Alternatively, an active reinforcement learning agent changes its policy as it goes and learns. Passive learning has the drawbacks that it can take awhile to converge on good estimates for the unknown quantities, and it may limit how much of the space is actually explored, and as such, there may be little or no information about some states and better paths may remain unknown. Critic, Actor, and Actor-Critic methods We can broadly categorize various RL methods into three groups: • critic-only methods, which first learn a value function and then use that to define a policy, e.g. TD learning. • actor-only methods, which directly search the policy space. An example is an evolutionary approach where different policies are evolved. • actor-critic methods, where a critic and an actor are both included and learned separately. The critic observes the actor and evaluates its policy, determining when it needs to change. 21.1 Model-based learning Model-based learning is a simple approach to reinforcement learning. The basic idea: • learn an approximate model (i.e. P (s ′ |s, a), that is, T (s, a, s ′ ), and R(s, a, s ′ )) based on experiences • solve for values (i.e. using value iteration or policy iteration) as if the learned model were correct In more detail: 622 CHAPTER 21. REINFORCEMENT LEARNING 623 21.1. MODEL-BASED LEARNING 1. learn an empirical MDP model • count outcomes s ′ for each s, a • normalize to get an estimate of T̂ (s, a, s ′ ) • discover each R̂(s, a, s ′ ) when we experience (s, a, s ′ ) 2. solve the learned MPD (e.g. value iteration or policy iteration) 21.1.1 Temporal Difference Learning (TDL or TD Learning) In temporal difference learning, the agent moves from one state s to the next s ′ , looks at the reward difference between the states, then backs up (propagates) the values (as in value iteration) from one state to the next. We run this many times, reaching a terminal state, then restarting, and so on, to get better estimates of the rewards (utilities) for each state. We keep track of rewards for visited states as U(s) and also the number of times we have visited each state as N(s). The main part of the algorithm is: • • • • if s ′ is new then U[s ′ ] = r ′ if s is not null then increment Ns [s] U[s] = U[s] + α(Ns [s])(r + γU[s] − U[s]) Where α is the learning rate and γ is the discount factor. Another way of thinking of TD learning: Consider a sequence of values v1 , v2 , v3 , . . . , vt . We want to estimate the value vt+1 . We might do t so by averaging the observed values, e.g. v̂t+1 = v1 +···+v . t We can rearrange these terms to give us: t v̂t+1 = v1 + · · · + vt−1 + vt t v̂t+1 = (t − 1)v̂t + vt 1 vt v̂t+1 = (1 − )v̂t + t t vt − v̂t v̂t+1 = v̂t + t The term vt − v̂t is the temporal difference error. Basically our estimate v̂t+1 is derived by updating the previous estimate v̂t proportionally to this error. For instance, if vt > v̂t then the next estimate is increased. This approach treats all values with equal weight, though we may want to decay older values. CHAPTER 21. REINFORCEMENT LEARNING 623 21.2. MODEL-FREE LEARNING 624 Greedy TDL One method of active reinforcement learning is the greedy approach. Given new estimates for the rewards, we recompute a new optimal policy and then use that policy to guide exploration of the space. This gives us new estimates for the rewards, and so on, until convergence. Thus it is greedy in the sense that it always tries to go for policies that seem immediately better, although in the end that doesn’t necessarily guarantee the overall optimal policy (this is the exploration vs exploitation problem). One alternate approach is to randomly try a non-optimal action, thus exploring more of the space. This works, but can be slow to converge. 21.1.2 Exploration agent A approach better than TDL is to use an exploration agent, which favors exploring more when it is uncertain. More specifically, we can use the same TDL algorithm, but while Ns < ϵ, where ϵ is some exploration threshold, we set U[s] = R, where R is the largest reward we expect to get. When Ns > ϵ, we start using the learned reward as with regular TDL. 21.2 Model-free learning With model-free learning, instead of trying to estimate T or R, we take actions and the actual outcome to what we expected the outcome would be. With passive reinforcement learning, the agent is given an existing policy and just learns from the results of that policy’s execution (that is, learns the state values; i.e. this is essentially just policy evaluation, except this is not offline, this involves interacting with the environment). To compute the values for each state under π, we can use direct evaluation: • act according to π • every time we visit a state, record what the sum of discounted rewards turned out to be • average those samples Direct evaluation is simple, doesn’t require any knowledge of T or R, and eventually gets the correct average values. However, it throws out information about state connections, since each state is learned separately - for instance, if we have a state si with a positive reward, and another state sj that leads into it, it’s possible that direct evaluation assigns state sj a negative reward, which doesn’t make sense - since it leads to a state with a positive reward, it should also have some positive reward. Given enough time/samples, this will eventually resolve, but that can require a long time. Policy evaluation, on the other hand, does take in account the relationship between states, since the value of each state is a function of its child states, i.e. 624 CHAPTER 21. REINFORCEMENT LEARNING 625 21.2. MODEL-FREE LEARNING πi Vk+1 (s) = ∑ s′ T (s, πk (s), s ′ )[R(s, πk (s), s ′ ) + γVkπi (s ′ )] However, we don’t know T and R. Well, we could just try actions and take samples of outcomes s ′ and average: π Vk+1 (s) = 1∑ samplei n i Where each samplei = R(s, π(s), si′ ) + γVkπ (si′ ). R(s, π(s), si′ ) is just the observed reward from taking the action. This is called sample-based policy evaluation. One challenge here: when you try an action, you end up in a new state - how do you get back to the original state to try another action? We don’t know anything about the MDP so we don’t necessarily know what action will do this. So really, we only get one sample, and then we’re off to another state. With temporal difference learning, we learn from each experience (“episode”); that is, we update V (s) each time we experience a transition (s, a, s ′ , r ). The likely outcomes s ′ will contribute updates more often. The policy is still fixed (given), and we’re still doing policy evaluation. Basically, we have an estimate V (s), and then we take an action and get a new sample. We update V (s) like so: V π (s) = (1 − α)V π (s) + (α)sample So we specify a learning rate α (usually small, e.g. α = 0.1) which controls how much of the old estimate we keep. This learning rate can be decreased over time. This is an exponential moving average. This update can be re-written as: V π (s) = V π (s) + α(sample − V π (s)) The term (sample − V π (s)) can be interpreted as an error, i.e. how off our current estimate V π (s) was from the observed sample. So we still never learn T or R, we just keep running sample averages instead; hence temporal difference learning is a model-free method for doing policy evaluation. However, it doesn’t help with coming up with a new policy, since we need Q-values to do so. CHAPTER 21. REINFORCEMENT LEARNING 625 21.2. MODEL-FREE LEARNING 21.2.1 626 Q-Learning With active reinforcement learning, the agent is actively trying new things rather than following a fixed policy. The fundamental trade-off in active reinforcement learning is exploitation vs exploration. When you land on a decent strategy, do you just stick with it? What if there’s a better strategy out there? How do you balance using your current best strategy and searching for an even better one? Remember that value iteration requires us to look at maxa over the set of possible actions from a state: Vk+1 (s) = max a ∑ T (s, a, s ′ )[R(s, a, s ′ ) + γVk (s ′ )] s′ However, we can’t compute maximums from samples since the maximum is always unknown (there’s always the possibility of a new sample being larger; we can only compute averages from samples). We can instead iteratively compute Q-values (Q-value iteration): Qk+1 (s, a) = ∑ s′ T (s, a, s ′ )[R(s, a, s ′ ) + γ max Q(s ′ , a′ )] ′ a Remember that while a value V (s) is the value of a state, a Q-value Q(s, a) is the value of an action (from a particular state). Here the max term pushed inside, and we are ultimately just computing an average, so we can compute this from samples. This is the basis of the Q-Learning algorithm, which is just sample-based Q-value iteration. We learn Q(s, a) values as we go: • take an action a from a state s and see the outcome as a sample (s, a, s ′ , r ). • consider the old estimate Q(s, a) • consider the new sample estimate: sample = R(s, a, s ′ )+γ maxa′ Q(s ′ , a′ ), where R(s, a, s ′ ) = r , i.e. the reward we just received • incorporate this new estimate into a running average: Q(s, a) = (1 − α)Q(s, a) + (α)sample This can also be written: Q(s, a) =α R(s, a, s ′ ) + γ max Q(s ′ , a′ ) ′ a 626 CHAPTER 21. REINFORCEMENT LEARNING 627 21.2. MODEL-FREE LEARNING These updates emulate Bellman updates as we do in known MDPs. Q-learning converges to an optimal policy, even if you’re acting suboptimal. When an optimal policy is still learned from suboptimal actions, it is called off-policy learning. Another way of saying this is that with off-policy learning, Q-values are updated not according to the current policy (i.e. the current actions), but according to a greedy policy (i.e. the greedy/best actions). We still, however, need to explore and decrease the learning rate (but not too quickly or you’ll stop learning things). In Q-Learning, we don’t need P or the reward/utility function. We directly learn the rewards/utilities of state-action pairs, Q(s, a). With this we can just choose our optimal policy as: π(s) = argmax a ∑ Q(s, a) s′ The Q update formula is simply: Q(s, a) = Q(s, a) + α(R(s) + γ max Q(s ′ , a′ ) − Q(s, a)) Where α is the learning rate and γ is the discount factor. Again, we can back up (as with value iteration) to propagate these values through the network. Note that a simpler version of Q-learning is SARSA, (“State Action Reward State Action”, because the quintuple (st , at , rt , st+1 , at+1 ) is central to this method), which uses the update: Q(s, a) = Q(s, a) + α(R(s) + γQ(s ′ , a′ ) − Q(s, a)) SARSA, in contrast to Q-learning, is on-policy learning; that is, it updates states based on the current policy’s actions, so Q-values are learned according to the current policy and not a greedy policy. n-step Q-learning An action may be responsible for a reward later on, so we want to be able to learn that causality, i.e. propagate rewards. The default one-step Q-learning and SARSA algorithms only associate reward with the direct state-action pair s, a that immediately led to it. We can instead propagate these rewards further with n-step variations, e.g. n-step Q-learning updates Q(s, a) with: rt + γrt+1 + · · · + γ n−1 rt+n + max γ n Q(s + t + n + 1, a) a CHAPTER 21. REINFORCEMENT LEARNING 627 21.2. MODEL-FREE LEARNING 21.2.2 628 Exploration vs exploitation Up until now we have not considered how we select actions. So how do we? That is, how do we explore? One simple method is to sometimes take random actions (ϵ-greedy). With a small probability ϵ, act randomly, with probability 1 − ϵ, act on the current best policy. After the space is thoroughly explored, you don’t want to keep moving randomly - so you can decrease ϵ over time. A simple modification for ϵ-greedy action selection is soft-max action selection, where actions are chosen based on their estimated Q(s, a) value. One specific method is to use a Gibbs or Boltzmann distribution where selecting action a in state s is proportional to e Q(s,a)/T where T > 0 is a temperature which influences how randomly actions should be chosen. The higher the temperature, the more random; when T = 0, the best-valued action is always chosen. More specifically, in state s, action a is chosen with the probability: e Q(s,a)/T Q(s,a)/T ae ∑ Alternatively, we can use exploration functions. Generally, we want to explore areas we have high uncertainty for. More specifically, an exploration function takes a value estimate u and a visit count n and returns an optimistic utility. For example: f (u, n) = u + kn . We can modify our Q-update to incorporate an exploration function: Q(s, a) =α R(s, a, s ′ ) + γ max f (Q(s ′ , a′ ), N(s ′ , a′ )) ′ a This encourages the agent not only to try unknown states, but to also try states that lead to unknown states. In addition to exploration and exploitation, we also introduce a concept of regret. Naturally, mistakes are made as the space is explored - regret is a measure of the total mistake cost. That is, it is the difference between your expected rewards and optimal expected rewards. We can try to minimize regret - to do so, we must not only learn to be optimal, but we must optimally learn how to be optimal. For example: both random exploration and exploration functions are optimal, but random exploration has higher regret. 21.2.3 Approximate Q-Learning Sometimes state spaces are far too large to satisfactorily explore. This can be a limit of memory (since Q-learning keeps a table of Q-values) or simply that there are too many states to visit in a reasonable time. In fact, this is the rule rather than the exception. So in practice we cannot learn about every state. 628 CHAPTER 21. REINFORCEMENT LEARNING 629 21.2. MODEL-FREE LEARNING The general idea of approximate Q-learning is to transfer learnings from one state to other similar states. For example, if we learn from exploring one state that a fire pit is bad, then we can generalize that all fire pit states are probably bad. This is an approach like machine learning - we want to learn general knowledge from a few training states; the states are represented by features (for example, we could have a binary feature has fire pit). Then we describe q-states in terms of features, e.g. as linear functions (called a Q-function; this method is called linear function approximation): Q(s, a) = w1 f1 (s, a) + w2 f2 (s, a) + · · · + wn fn (s, a) Note that we can do the same for value functions as well, i.e. V (s) = w1 f1 (s) + w2 f2 (s) + · · · + wn fn (s) So we observe a transition (s, a, r, s ′ ) and then we compute the difference of this observed transition from what we expected, i.e: difference = [r + γ max Q(s ′ , a′ )] − Q(s, a) ′ a With exact Q-learning, we would update Q(s, a) like so: Q(s, a) = Q(s, a) + α[difference] With approximate Q-learning, we instead update the weights, and we do so in proportion to their feature values: wi = wi + α[difference]fi (s, a) This is the same as least-squares regression. That is, given a point x, with features f (x) and target value y , the error is: error(w ) = ∑ 1 (y − wk fk (x))2 2 k The derivative of the error with respect to a weight wm is: ∑ ∂error(w ) = −(y − wk fk (x))fm (x) ∂wm k CHAPTER 21. REINFORCEMENT LEARNING 629 21.2. MODEL-FREE LEARNING 630 Then we update the weight: wm = wm + α(y − ∑ wk fk (x))fm (x) k In terms of approximate Q-learning, the target y is r + γ maxa′ Q(s ′ , a′ ) and our prediction is Q(s, a): wm = wm + α[r + γ max Q(s ′ , a′ ) − Q(s, a)]fm (s, a) ′ a 21.2.4 Policy Search Q-learning tries to model the states by learning q-values. However, a feature-based Q-learning model that models the states well does not necessarily translate to a good feature-based policy (and viceversa). Instead of trying to model the unknown states, we can directly try to learn the policies that maximize rewards. So we can use Q-learning and learn a decent solution, then fine-tune by hill climbing on feature weights. That is, we learn an initial linear Q-function, the nudge each feature weight and up and down to see if the resulting policy is better. We test whether or not a policy is better by running many sample episodes. If we have many features, we have to test many new policies, and this hill climbing approach becomes impractical. There are better methods (not discussed here). 21.2.5 Summary A helpful table: For a known MDP, we can compute an offline solution: Goal Technique Compute V ∗ , Q∗ , π ∗ Value or policy iteration Evaluate a fixed policy π Policy evaluation For an unknown MDP, we can use model-based approaches: 630 Goal Technique Compute V ∗ , Q∗ , π ∗ Value or policy iteration on the approximated MDP Evaluate a fixed policy π Policy evaluation on the approximated MDP CHAPTER 21. REINFORCEMENT LEARNING 631 21.3. DEEP Q-LEARNING Goal Technique Or we can use model-free approaches: Goal Technique Compute V ∗ , Q∗ , π ∗ Q-learning Evaluate a fixed policy π Value learning 21.3 Deep Q-Learning The previous Q-learning approach was tabular in that we essentially kept a table of mappings from (s, a) to some value. However, we’d like to be a bit more flexible and not have to map exact states to values, but map similar states to similar values. The general idea behind deep Q-learning is using a deep neural network to learn Q(s, a), which gives us this kind of mapping. This is essentially a regression problem, since Q-values are continuous. So we can use a squared error loss in the form of a Bellman equation: L= 1 [r + max Q(s ′ , a′ ) − Q(s, a)]2 a′ 2 Where the r + maxa′ Q(s ′ , a′ ) term is the target value and Q(s, a) is the predicted value. Approximating Q-values using nonlinear functions is not very stable, so tricks are needed to get good performance. One problem is catastrophic forgetting, in which similar states may lead to drastically different outcomes. For instance, there may be a state for which is a single move away from winning, and then another similar state where that same move leads to failure. When the agent wins from that first state, it will assign a high value to it. Then, when it loses from the similar state, it revises its value negatively, and in doing so it “overwrites” its assessment of the other state. So catastrophic forgetting occurs when similar states lead to very different outcomes, and when this happens, the agent is unable to properly learn. One trick for this is experience replay in which each experience tuple (s, a, r, s ′ ) are saved (this collection of saved experiences is called “replay memory”). Memory size is often limited to keep only the last n experiences. Then the network is trained using random minibatches sampled from the replay memory. This essentially turns the task into a supervised learning task. A deep Q-learning algorithm that includes experience replay and ϵ-greedy exploration follows (source): CHAPTER 21. REINFORCEMENT LEARNING 631 21.4. REFERENCES • • • • 632 initialize replay memory D initialize action-value function Q (with random weights) observe initial state s repeat – select an action a * with probability ϵ select a random action * otherwise select a = argmaxa′ Q(s, a′ ) – – – – – carry out action a observe reward r and new state s ′ store experience < s, a, r, s ′ > in replay memory D sample random transitions < ss, aa, r r, ss ′ > from replay memory D calculate target for each minibatch transition * if ss ′ is terminal state then tt = r r * otherwise tt = r r + γ maxa′ Q(ss ′ , aa′ ) – train the Q network using (tt − Q(ss, aa))2 as loss – s = s′ 21.4 References • Reinforcement Learning - Part 1. Brandon B. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • 11.3 Reinforcement Learning. Artificial Intelligence: Foundations of Computational Agents. David Poole and Alan Mackworth. Cambridge University Press, 2010. • Reinforcement Learning in a Nutshell. V. Heidrich-Meisner, M. Lauer, C. Igel, M. Riedmiller. • Asynchronous Methods for Deep Reinforcement Learning. Volodymyr Mnih, Adrià Puigdomènech Badia, Mehdi Mirza, Alex Graves, Tim Harley, Timothy P. Lillicrap, David Silver, Koray Kavukcuoglu. • Demystifying Deep Reinforcement Learning. Tambet Matiisen. • Reinforcement Learning: A Tutorial. Mance E. Harmon & Stephanie S. Harmon. • Q-learning with Neural Networks. Brandon B. 632 CHAPTER 21. REINFORCEMENT LEARNING 633 22 Filtering Often an agent is uncertain what state the world is in. Filtering (or monitoring) is the task of tracking and updating the belief state (the distribution) Bt (X) = Pt (Xt |e1 , . . . , et ) as new evidence is observed. We start with B1 (X) with some initial setting (typically uniform), and update as new evidence is observed/time passes. 22.1 Particle filters Sometimes we have state spaces which are too large for exact inference (i.e. too large for the forward algorithm) or just to hold in memory. For example, if the state space X is continuous. Instead, we can use particle filtering, which provides an approximate solution. With particle filters, possible states are represented as particles (vectors); the density of these vectors in state space represents the posterior probability of being in a certain state (that is, higher density means the true state is more likely in that region), and the set of all these vectors represents the belief state. Another way of putting this: Particles are essentially samples of possible states. Each particle can be thought of as a hypothesis that we are in the state it represents. The more particles there are for a state mean the more likely we are in that state. So to start, these particles may be very diffuse, spread out across the space somewhat uniformly. As more data (measurements/observations) is collected, the particles are resampled and placed according to these observations, and they start to concentrate in more likely regions. More formally, our representation of P (X) is now a list of N particles (generally N << X, and we don’t need to store X in memory anymore, just the particles). P (x) is approximated by the number of particles with value x (i.e. the more particles that have value x, the more likely state x is). CHAPTER 22. FILTERING 633 22.1. PARTICLE FILTERS 634 Particles have weights, and they all start with a weight of 1. As time passes, we “move” each particle by sampling its next position from the transition model: x ′ = sample(P (X ′ |x)) As we gain evidence, we fix the evidence and downweight samples based on the evidence: w (x) = P (e|x) B(X) ∝ P (e|x)B ′ (X) These particle weights reflect how likely the evidence is from that particle’s state. A result of this is that the probabilities don’t sum to one anymore. This is similar to likelihood weighting. Rather than tracking the weighted samples, we resample. That is, we sample N times from the weighted sample distribution (i.e. we draw with replacement). This is essentially renormalizing the distribution and has the effect of “moving” low-weight (unlikely) samples to where high-weight samples are (i.e. to likely states), so they become more “useful”. The particle filter algorithm: # s is a set of particles with importance weights # u is a control vector # z is a measurement vector def particle_filter(s, u, z): # a new particle set s_new = [] eta = 0 n = len(s) for i in range(n): # sample a particle (with replacement) # based on the importance weights p = sample(s) # sample a possible successor state (i.e. a new particle) # according to the state transition probability # and the sampled particle # p’ ~ p(p’|u, p) p_new = sample_next_state(u, p) # use the measurement probability as the importance weight 634 CHAPTER 22. FILTERING 635 22.2. KALMAN FILTERS # p(z|p’) w_new = measurement_prob(z, p_new) # save to new particle set s_new.append((p_new, w_new)) # normalize the importance weights # so they act as a probability distribution eta = sum(w for p, w in s_new) for i in range(n): s_new[i][1] /= eta Particle filters do not scale to high-dimensional spaces because the number of particles you need to fill a high-dimensional space grows exponentially with the dimensionality. Though there are some particle filter methods that can handle this better. But they work well for many applications. They are easy to implement, computationally efficient, and can deal well with complex posterior distributions. 22.1.1 DBN particle filters There are also DBN particle filters in which each particle represents a full assignment to the world (i.e. a full assignment of all variables in the Bayes’ net). Then at each time step, we sample a successor for each particle. When we observe evidence, we weight each entire sample by the likelihood of the evidence conditioned on the sample. Then we resample - select prior samples in proportion to their likelihood. Basically, a DBN particle filter is a particle filter where each particle represents multiple assigned variables rather than just one. 22.2 Kalman Filters (note: the images below are all sourced from How a Kalman filter works, in pictures, Tim Babb) Kalman filters can provide an estimate for the current state of a system and from that, provide an estimate about the next state of the system. They make the approximation that everything is Gaussian (i.e. transmissions and emissions). We have some random variables about the current state (in the example below, they are position p and velocity v ); we can generalize this as a random variable vector S. We are uncertain about the current state but (we assume) they can be expressed as Gaussian distributions, parameterized by a mean value and a variance (which reflects the uncertainty). CHAPTER 22. FILTERING 635 22.2. KALMAN FILTERS 636 636 CHAPTER 22. FILTERING 637 22.2. KALMAN FILTERS These random variables may be uncorrelated, as they are above (knowing the state of one tells us nothing about the other), or they may be correlated like below. This correlation is described by a covariance matrix, Σ, where the element σij describes the correlation between the i th and jth random variables. Covariance matrices are symmetric. We say the current state is at time t − 1, so the random variables describing it is notated St−1 , and the next state (which we want to predict) is time t, so the random variables we predict then are notated St The Kalman filter basically takes the random variable distributions for the current state and gives us new random variable distributions for the next state: In essence it moves each possible point for the current state to a new predicted point. We then have to come up with some function for making the prediction. In the example of position p and velocity v , we can just use pt = pt−1 + ∆tvt−1 to update the position and assume the velocity is kept constant, i.e. vt = vt−1 . We can represent these functions collectively as a matrix applied to the state vector: CHAPTER 22. FILTERING 637 22.2. KALMAN FILTERS 638 Correlated Gaussian random variables 638 CHAPTER 22. FILTERING 639 22.2. KALMAN FILTERS From current distributions to predicted distributions for the next state CHAPTER 22. FILTERING 639 22.2. KALMAN FILTERS 640 [ ] 1 ∆t St = St−1 0 1 = Ft St−1 We call this matrix Ft our prediction matrix. With this, we can transform the means of each random variable (we notate the vector of these means at Ŝt−1 since these means are our best estimates) to the predict means in the next state, Ŝt . We can similarly apply the prediction matrix to determine the covariance at time t, using the property Cov (Ax) = AΣAT , so that: Σt = Ft Σt−1 FtT It’s possible we also want to model external influences on the system. In the position and velocity example, perhaps some acceleration is being applied. We can capture these external influences in a vector ut , which is called the control vector. For the position and velocity example, this control vector would just have acceleration a, i.e. ut = [a]. We then need to update your prediction functions for each random variable in S to incorporate it, i.e. pt = pt−1 + ∆tvt−1 + 12 ∆t 2 a and vt = vt−1 + a∆t. Again, we can pull out the coefficients for the control vector terms into a matrix. For this example, it would be: [ ∆t 2 ] Ut = 2 ∆t This matrix is called the control matrix, which we’ll notate as Ut . We can then update our prediction function: St = Ft St−1 + Ut ut These control terms capture external influences we are certain about, but we also want to model external influences we are uncertain about. To model this, instead of moving each point from the distributions of St−1 exactly to where the prediction function says it should go, we also describe these new predicted points as Gaussian distributions with covariance matrices Qt . We can incorporate the uncertainty modeled by Qt by including it when we update the predicted covariance at time t: Σt = Ft Σt−1 FtT + Qt 640 CHAPTER 22. FILTERING 641 22.2. KALMAN FILTERS Modeling uncertainty in the predicted points CHAPTER 22. FILTERING 641 22.2. KALMAN FILTERS 642 Now consider that we have sensors which measure the current state for us, though there is some measurement error (noise). We can model these sensors with the matrix Ht (which would include measured values for each of our state random variables) and incorporate them: µexpected = Ht Ŝt Σexpected = Ht Pt HTt This gives us the final equation for our predicted state values. Now say we’ve come to the next state and we get in new sensor values. This allows us to observe the new state (with some noise/uncertainty) and combine it to our predicted state values to get a more accurate estimate of the new current state. The readings we get for our state random variables (e.g. position and velocity) are represented by a vector zt , and the uncertainty/noise (covariance) in these measurements is described by the covariance matrix Rt . Basically, these sensors are also described as Gaussian distributions, where the values the sensor gave us, zt , is considered the vector of the means for each random variable. Uncertainty in sensor readings 642 CHAPTER 22. FILTERING 643 22.2. KALMAN FILTERS We are left with two Gaussians - one describing the sensor readings and their uncertainty, and another describing the predicted values and their uncertainty. We can multiply the distributions to get their overlap, which describes the space of values likely for both distributions. Overlap of the two Gaussians The resulting overlap is, yet again, also a Gaussian distribution with its own mean and covariance matrix. We can compute this new mean and covariance from the two distributions that formed it. First, consider the product of two 1D Gaussian distributions: ? N (x, µ0 , σ0 ) · N (x, µ1 , σ1 ) = N (x, µ, σ) As a reminder, the Gaussian distribution is formalized as: (xµ)2 1 N (x, µ, σ) = √ e − 2σ2 σ 2π CHAPTER 22. FILTERING 643 22.2. KALMAN FILTERS 644 The product of two 1D Gaussians We can solve for both µ and σ 2 to get: σ02 (µ1 µ0 ) σ02 + σ12 σ4 σ 2 = σ02 2 0 2 σ0 + σ1 µ = µ0 + To make this more readable, we can factor out k, such that: k= σ02 σ02 + σ12 µ = µ0 + k(µ1 − µ0 ) σ 2 = σ02 − kσ02 In dimensions higher than 1, we can re-write the above with matrices (µ are now vectors here): K= Σ20 Σ20 + Σ21 µ = µ0 + K(µ1 − µ0 ) Σ2 = Σ20 − KΣ20 This matrix K is the Kalman gain. 644 CHAPTER 22. FILTERING 645 22.3. REFERENCES So we have the two following distributions: • The predicted state: µ0 , Σ0 = (Ht St , Ht Σt HtT ) • The observed state: µ1 , Σ1 = (zt , Rt ) And using the above, we compute their overlap to get a new best estimate: Ht Ŝt = Ht Ŝt + K(zt − Ht Ŝt ) Ht Σt HtT = Ht Σt HtT − KHt Σt HtT K= Ht Σt HtT Ht Σt HtT + Rt Simplifying a bit, we get: Ŝt = Ŝt + K(zt − Ht Ŝt ) Σt = Σt − KHt Σt Σt HtT K= Ht Σt HtT + Rt Which are the equations for the update step, which gives us the new best estimate Σ̂. Kalman filters work for modeling linear systems; for nonlinear systems you instead need to use the extended Kalman filter. 22.3 References • How a Kalman filter works, in pictures. August 11, 2015. Tim Babb. • Intro to Artificial Intelligence. CS271. Peter Norvig, Sebastian Thrun. Udacity. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). CHAPTER 22. FILTERING 645 22.3. REFERENCES 646 646 CHAPTER 22. FILTERING 647 23 In Practice In practice, there is no one-size-fits-all solution for AI problems. Generally, some combination of techniques is required. 23.1 Starcraft Starcraft is hard for AI because: • • • • • • • adversarial long horizon partially observable (fog-of-war) realtime (i.e. 24fps, one action per frame) huge branching factor concurrent (i.e. players move simultaneously) resource-rich There is no single algorithm (e.g. minimax) that will solve it off-the-shelf. The Berkeley Overmind won AIIDE 2010 (a Starcraft AI competition). It used: • • • • • • • search: for path planning for troops CSPs: for base layout (i.e. buildings/facilities) minimax: for targeting of opponent’s troops and facilities reinforcement learning (potential fields): for micro control (i.e. troop control) inference: for tracking opponent’s units scheduling: for managing/prioritizing resources hierarchical control: high-level to low-level plans CHAPTER 23. IN PRACTICE 647 23.2. REFERENCES 648 23.2 References • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). 648 CHAPTER 23. IN PRACTICE 649 Part IV Simulation 649 651 24 Agent-Based Models Agent-based models include: • • • • • individual agents model intelligent behavior, usually with a simple set of rules the agents are situated in some space or a network and interact with each other locally the agents usually have imperfect, local information there is usually variability between agents often there are random elements, either among the agents or in the world 24.1 Agents • An agent is an entity that perceives and acts. • A rational agent selects actions that maximize its (expected) utility. • Characteristics of the percepts, environment, and action space dictate techniques for selecting rational actions. Reflex agents choose actions based on the current percept (and maybe memory). They are concerned almost exclusively with the current state of the world - they do not consider the future consequences of their actions, and they don’t have a goal that their are working towards. Rather, they just operate off of simple “reflexes”. Agents that plan consider long(er) term consequences of their actions, have a model of how the world changes based on their actions, and work towards a particular goal (or goals), and can find an optimal solution (plan) for achieving its goal or goals. 24.1.1 Brownian agents A Brownian agent is described by a set of state variables ui(k) where i ∈ [1, . . . , N] refers to the individual agent i and k indicates the different variables. CHAPTER 24. AGENT-BASED MODELS 651 24.2. MULTI-TASK AND MULTI-SCALE PROBLEMS 652 These state variables may be external, which are observable from outside the agent, or internal degrees of freedom that must be inferred from observable actions. The state variables can change over time due to the environment or internal dynamics. We can generally express the dynamics of the state variables as follows: dui(k) = fi (k) + Fistoch dt The principle of causality is represented here: any effect such as a temporal change of variable u has some causes on the right-hand side of the equation; such causes are described as a superposition of deterministic and stochastic influences imposed on the agent i . In this formulation, fi (k) is a deterministic term representing influences that can be specified on the time and length scale of the agent, whereas Fistoch is a stochastic term which represents influences that exist, but not observable on the time and length scale of the agent. The deterministic term fi (k) captures all specified influences that cause changes to the state variable ui(k) , including interactions with other agents j ∈ N, so it could be a function of the state variables of other agents in addition to external conditions. 24.2 Multi-task and multi-scale problems A multi-task domain is an environment where an agent performs two or more separate tasks. A multi-scale domain is a multi-task domain that satisfies the following: • multiple structural scales: actions are performed across multiple levels of coordination • interrelated tasks: there is not a strict separation across tasks and the performance in each tasks impacts other tasks • actions are performed in real-time More generally, multi-scale problems involve working at many different levels of detail. For example, an AI for an RTS game must manage many simultaneous goals at the micro and macro level, and these goals and their tasks are often interwoven, and all this must be done in real-time. 24.3 Utilities We encode preferences for an agent, e.g. A ≻ B means the agent prefers A over B (on the other hand, A ∼ B means the agent is indifferent about either). A lottery represents these preferences under uncertainty, e.g. [p, A; 1 − pB]. Rational preferences must obey the axioms of rationality: 652 CHAPTER 24. AGENT-BASED MODELS 653 24.4. REFERENCES • orderability: (A ≻ B) ∨ (B ≻ A) ∨ (A ∼ B). You either have to like A better than B, B better than A, or be indifferent. • transitivity: (A ≻ B) ∧ (B ≻ C) =⇒ (A ∼ C) • continuity: A ≻ B ≻ C =⇒ ∃p[p, A; 1 − p, C] ∼ B. That is, if B is somewhere between A and C, there is some lottery between A and C that is equivalent to B. • substitutability: A ∼ B =⇒ [p, A; , 1 − p, C] ∼ [p, B; 1 − p, C]. If you’re indifferent to A and B, you are indifferent to them in lotteries. • monotonicity: A ≻ B =⇒ (p ≥ q ⇔ [p, A; , 1 − p, B] ⪰ [q, A; 1 − q, B]). If you prefer A over B, when given lotteries between A and B, you prefer the lottery that is biased towards A. When preferences are rational, they imply behavior that maximizes expected utility, which implies we can come up with a utility function to represent these preferences. That is, there exists a real-valued function U such that: U(A) ≥ U(B) ⇔ A ⪰ B U([p1, S1 ; . . . ; pn , Sn ]) = ∑ pi U(Si ) i The second equation says that the utility of a lottery is the expected value of that lottery. 24.4 References • Think Complexity. Version 1.2.3. Allen B. Downey. 2012. • An Agent-Based Model of Collective Emotions in Online Communities. Frank Schweitzer, David Garcia. Swiss Federal Institute of Technology Zurich. 2008. • Integrating Learning in a Multi-Scale Agent. Ben G. Weber. 2012. • CS188: Artificial Intelligence. Dan Klein, Pieter Abbeel. University of California, Berkeley (edX). CHAPTER 24. AGENT-BASED MODELS 653 24.4. REFERENCES 654 654 CHAPTER 24. AGENT-BASED MODELS 655 25 Nonlinear Dynamics Two approaches to science: • • • • • • • • mathematical, defined by equations, using proofs (“classical” models). typically deterministic generally involve many simplifying assumptions uses linear approximations to model non-linear systems computational-based, typically defined by simple rules, using simulations (“complex” models) often stochastic often also involve simplifying assumptions, but less deals better with non-linear systems (Chapter 1 of Think Complexity provides a good overview of these two approaches). Some systems may be very hard to accurately model, even though they may be deterministic. Complex behavior arises from deterministic nonlinear dynamic systems and exhibit two special properties: • sensitive dependence on initial conditions • characteristic structure Most nonlinear dynamic systems are chaotic, and nonlinear dynamic systems constitute most of the dynamic systems we encounter. In general, systems involving flows (heat, fluid, etc) demonstrate nonlinear dynamics, but they also show up in classical mechanics (e.g. the three-body problem, the double-jointed pendulum). The equations that describe chaotic systems can’t be solved analytically - they are solved with computers instead. CHAPTER 25. NONLINEAR DYNAMICS 655 25.1. MAPS 656 25.1 Maps Maps describe systems that operate in discrete time intervals. In particular, a map is a mathematical operator that advances the system one time step (i.e. the next step). We describe them using a difference equation (not to be confused with differential equations, which come up later): xn+1 = f (xn ) Where f is the map and xi is the state of the system at time step i . The states (i.e. each of x0 , x1 , . . . , also called iterates) of a map may converge to a fixed point where they no longer change as the map is further applied (i.e. it is invariant to the dynamics of the system), which is notated as x ∗ . There are different kinds of fixed points: • attracting fixed points, which the system tends towards when perturbed (stable) • unstable fixed points, in which the dynamics are stationary, but the system is not “naturally drawn” to (they are repelling). If they are perturbed from this point, they do not settle back into it. For example, if you drop the double pendulum, eventually it settles to a stationary position: _____ | | 0 This is an attracting fixed point. There are other fixed points in this system. For example: 0 | | __|__ That is, the pendulum could be balanced on top, in which it would remain stationary, but easily disturbed, and this is not one that the system would settle into. The time steps leading to a fixed point is called the transient. The sequence of iterates is called an orbit or a trajectory of the dynamical system. The first state x0 is the initial condition. 656 CHAPTER 25. NONLINEAR DYNAMICS 657 25.1. MAPS A common map is the logistic map, L(xn ) (often used to model populations): xn+1 = r xn (1 − xn ) It includes a parameter r ∈ (0, 4) and x ∈ (0, 1). Attractors are the states that remain after the transient dies out (i.e. after the system “settles”). Attracting fixed points are one kind of attractor, but there are also periodic (oscillating) and chaotic (strange) attractors. A basin of attraction is the set of initial conditions which eventually converge on the same attractor. One attractor is the (fixed) periodic orbit (also called a limit cycle) which is just a sequence of iterates that repeats indefinitely. A particular cycle may be referred to as an n-cycle, where n refers to the period of the cycle, i.e. the number of time steps that the cycle repeats over. 25.1.1 Bifurcations A bifurcation refers to a qualitative change in the topology of an attractor. For example, in the logistic map, one value of r may give a fixed point attractor, but changing it to another value may change it to a (fixed) periodic orbit attractor (here r would be called a bifurcation parameter). Note that “qualitative” change means it changes the kind of attractor, i.e. if changing r just shifts the fixed point, that is not a bifurcation. For example, with the logistic map: when r = 3.6, we have a chaotic attractor (also known as a strange attractor), when r = 3.1 we have a periodic attractor, and when r = 2 we have a fixed point attractor. 25.1.2 Return maps Often (1D/scalar) maps are plotted as a time domain plot, in which the horizontal axis is time n and the vertical axis is the state value xn . Another way of plotting them is using the (first) return map, also known as a correlation plot or a cobweb diagram, in which the horizontal axis is xn and the vertical axis is xn+1 . This is known as the first return map because we correlate xn with xn+1 . A second return map, for example, would correlate xn with xn+2 . On a return map, we also often include the line xn+1 = xn , which is the line on which any fixed points must lie (by definition). 25.1.3 Bifurcation diagrams A bifurcation diagram has as its horizontal axis r (i.e. from the logistic map) and has xn as its vertical axis. We also remove the transient from the front of the trajectory (knowing how remove the CHAPTER 25. NONLINEAR DYNAMICS 657 25.1. MAPS 658 A cobweb plot for the logistic map, from Wikipedia 658 CHAPTER 25. NONLINEAR DYNAMICS 659 25.1. MAPS transient, i.e. how many points to throw away, takes a bit of trial-and-error). That way we only see the points the system settles on for any given value of r (if indeed it settles). So for each value of r that leads to a fixed point attractor, there is only one value of xn For each value of r that leads to periodic attractors, we may have two or a few of xn . For each value of r that leads to chaotic attractors, we will have many, many values of xn . Bifurcation diagram of the logistic map, from Wikipedia When looking at a bifurcation diagram, you may notice some interesting structures. In particular, you may notice that some periodic attractors bifurcate into period attractors of double the cycle (e.g. a 2-cycle period that turns into a 4-cycle period at some value of r ). This is called a period-doubling cascade. You may also notice “dark veils” (they can look like thin lines cutting through) in the chaotic parts of the bifurcation diagram - they are the result of unstable periodic orbits. The bifurcation diagram can also be a fractal object in that it can contain copies of itself within itself. For example, you can “zoom in” on the diagram and find its own structure repeated at smaller levels. Note that many, but not all, chaotic systems have a fractal state-space structure. 25.1.4 Feigenbaum number If you look at the parts between bifurcations (the “pitchforks”) in a bifurcation diagram, you may notice that their widths and heights decrease at a constant ratio. CHAPTER 25. NONLINEAR DYNAMICS 659 25.2. FLOWS 660 Feigenbaum numbers from a bifurcation diagram, source If we take ∆i to be the width of a bifurcation i , we can frame this as (for widths) look at the limit of this as n → ∞ to figure out this ratio. For the logistic map: lim n→∞ ∆2 ∆1 = ∆3 ∆2 . We can ∆n = 4.66 ∆n+1 This value is called the Feigenbaum number, and it holds (as 4.66) for any 1D map with a quadratic maximum (i.e. it looks like a parabola near its maximum). For the heights of these pitchforks there’s a different value that’s computed in a similar way. 25.1.5 Sensitive to initial conditions One way of describing maps’ property of sensitivity to initial conditions is that they bring far points close together and cause close together points to be pushed far away from each other. One analogy is the kneading of dough. As you knead dough, parts that were close together end up far apart, and parts that were far apart end up close together (although if we consider the kneading as a continuous process, technically, this is a flow, but we can imagine it as discrete time steps). 25.2 Flows Flows describe systems that operate in continuous time (e.g. the double-jointed pendulum). They are modeled using differential equations rather than difference equations. 660 CHAPTER 25. NONLINEAR DYNAMICS 661 25.2. FLOWS All the concepts that apply to maps also apply to flows. In the double-jointed pendulum, the state has four components: the angle of the top joint θ1 , the angle of the lower joint θ2 , the angular velocity of the top joint ω1 , and the angular velocity of the lower joint ω2 . A system without any friction (more generally, friction is known as dissipation) is called a conservative system or a Hamiltonian system or a non-dissipative system. These systems do not have attracting fixed points because there is nothing to cause the transient to die out. They do, however, still have fixed points - just not attracting ones - and they still have chaos - just not chaotic attractors. Conversely, dissipative systems are those that have attractors. 25.2.1 Ordinary differential equations An ODE expresses relationships between derivatives of an unknown function. For example: d x(t) = 1 dt The unknown function here is x(t) and the derivative is wrt to time t. To solve this, we know that the derivative of x(t) is equal to 1, so we ask - what x(t) - that is, what set of functions - would make this true? Here, x(t) can be any function of time that has a slope of 1, i.e. x(t) = t + C. Another example: say we have the ODE x ′′ (t) = −x(t) and the initial conditions x(t = 0) = 1. This is asking: what function is the negative of its own second derivative? This could be sin or cos. The initial condition restricts this to cos because only cos(0) = 1. This is an analytic solution (closed-form, i.e. can be written out finitely). In most case, we will not be solving ODEs analytically, but numerically. This is because ODEs that can be solved analytically are, by definition, not chaotic. ODEs may be linear, which have the form of a sum of “constant times variable” terms, e.g. y = ax +b, where a, b are constants and y , x are variables. or ODEs may be nonlinear, which involve powers of the variables, products of variables, or transcendental functions involving the variables. Note that nonlinearity is a necessary condition for chaos. If the ODE is nonlinear, it’s possible there may not be an analytic solution - this property is known as nonintegrability, and it is a necessary and sufficient condition for chaos. They must be solved numerically. Technically, “nonintegrability” only applies to Hamiltonian systems, but here we will use it as a shorthand for “has no analytic solution” more generally. With flows we can think of fixed points in terms of a “dynamics landscape”, i.e. the topography of the system. We can think of stable fixed points as some kind of “bowl” that values “roll down”. In contrast, an unstable fixed point could be either an upside-down bowl or a saddle. CHAPTER 25. NONLINEAR DYNAMICS 661 25.2. FLOWS 662 Sidebar on some linear algebra: matrices can be applied to transform space, i.e. for rotations, scaling, translations, etc. A matrix of eigenvectors allows us to express the fundamental features of a landscape. A point that starts on an eigenvector stays on that eigenvector (“eigen” means “same”). An eigenvalue tells you how fast a state travels along an eigenvector and in what direction - specifically, the movement is exponential: e st , where s is the eigenvalue. Each eigenvalue is associated with an eigenvector. Say we have two crossing eigenvectors, one of which is associated with eigenvalue s1 and one associated with eigenvalue s2 . Both s1 , s2 are negative, which means that e st shrinks, meaning that both eigenvectors “point” inwards (note that the fixed point is marked with *): s_1 | v | --->-*-<---s_2 | ^ | That is, we have a bowl shape. If instead both eigenvalues were positive, we’d have an upside-down bowl. If one were positive and one were negative, we’d have a saddle. Of course in practice, the forms (bowl, upside-down bowl, saddle) are rarely this neat and tidy, but often we use these as (linear) approximations when looking locally (i.e. “zoomed in” on a particular region). When looking at a larger scale, we instead must resort to nonlinear mathematics - the eigenvectors typically aren’t “straight” at larger scales; they may become curvy. When a fixed point’s unstable eigenvector (that is, the one moving away from the fixed point) connects to the stable eigenvector of another fixed point (that is, the eigenvector moving into the other fixed point), that is called a heteroclinic orbit. For example (the relevant part has double-arrows, the weird hump is meant to be a curve to show that these eigenvectors are linear only locally around each fixed point): | | v | -->>-/ ---<-*->>-/ v \ | \-->>-*-<-- | | ^ ^ | | On the other hand, if a fixed point’s unstable eigenvector, in the large scale, loops back and connects to its stable eigenvector, that is called a homoclinic orbit. 662 CHAPTER 25. NONLINEAR DYNAMICS 663 25.2. FLOWS /-<<--\ | | | | | ^ v | | / ---<-*->>--/ | ^ | We call these larger structures (i.e. when looking beyond just the local eigenvectors, but rather the full curves that connect them) the stable or unstable manifolds of a fixed point. They are like nonlinear generalizations of eigenvectors in that they are invariant manifolds; that is, a state that starts on one of these manifolds stays on the manifold. They start out tangent to the eigenvectors (which is why we just use eigenvectors locally), but as mentioned before, they “curve” out depending on the dynamics landscape. Growth/movement along these manifolds is also exponential, like it is for eigenvectors. If all manifolds are stable, you have a fixed point (some kind of bowl, roughly speaking, but nonlinear). If all manifolds are unstable, you have a fixed point (some kind of upside-down bowl, roughly speaking). Also note: a nonlinear system can have any number of attractors, of all types (fixed points, limit cycles/periodic orbits, quasiperiodic orbits [not discussed in this class], chaotic attractors) scattered throughout its state space, but there is no way of knowing a priori where they are and what type they are (or even how many there are). Every point in the state space is in the basin of attraction of some attractor. The basins of attraction and the basin boundaries partition the state space. 25.2.2 More on ODEs An nth-order ODE can be broken up into n 1st-order ODEs. For example, take the ODE for a simple harmonic oscillator (a mass on a spring): mx ′′ + βx ′ + kx − mg = 0 This is a 2nd-order ODE. We can break it down into 1st-order ODEs like so: 1. Isolate the highest-order term: mg − βx ′ − kx x = m ′′ CHAPTER 25. NONLINEAR DYNAMICS 663 25.2. FLOWS 664 2. Then define a helper variable: x′ = v 3. Rewrite the whole equation using the helper variable: v′ = g − β k v− x m m We have actually defined two first-order ODEs (that is, it is a 2D ODE system), which we can represent as a vector: [ ] [ x′ = v′ g− v β mv − ] k mx There are no derivatives on the right-hand side, which is how we want things to be. The derivatives are isolated and the right-hand side just captures the dynamics of the system. The vector on the left-hand side is called a state vector. Here we started with a 2nd-order ODE so we only required one helper variable. More generally, for an nth-order ODE, you require n − 1 helper variables. Note that at least 3 dimensions is necessary for a chaotic system. The general form for an nth-order ODE system is as follows: ẋ1 = f1 (x1 , . . . , xn ) ẋ2 = f2 (x1 , . . . , xn ) .. . ẋn = fn (x1 , . . . , xn ) (As a reminder, ẋ is another notation for the derivative of x.) The state variables can be represented as a state vector: x1 x2 ⃗ x = ... xn This system defines a vector field. For every value of ⃗ x , we can compute ⃗ x˙ = f⃗(⃗ x ), which tells us the slope at that point (i.e. which way is downhill, and how steep it is). 664 CHAPTER 25. NONLINEAR DYNAMICS 665 25.2. FLOWS For linear systems, matrices can describe how a “ball rolls in a landscape” (e.g. bowls, saddles, etc). The description is only good locally for nonlinear systems, as mentioned earlier. For example, consider the following 2D linear system expressed with ODEs: ẋ1 = ax1 + bx2 ẋ2 = cx1 + dx2 This can be re-written as: x ⃗ x˙ = A⃗ [ x1 ⃗ x= x2 ] [ a b A= c d ] So the matrix A describes the dynamics of the system. But with a nonlinear system, we cannot write down such a matrix A and have only numbers in it. 25.2.3 Reminder on distinction b/w difference and differential equations A differential equation f⃗ takes a state vector ⃗ x and gives us ⃗ x˙ , that is, the derivative of ⃗ x. A difference equation f⃗ takes a state vector ⃗ xn and gives us the state vector at the next (discrete) xn+1 . time step, ⃗ 25.2.4 ODE Solvers An ODE solver takes as input: • an ODE • initial conditions, ⃗ x (t = t0 ) • a time difference ∆t And gives as output an estimate of ⃗ x (t0 + ∆t). There are different methods of doing this, but a common one is Forward Euler, sometimes just called Euler’s method or “follow the slope” - as it says, you just follow the slope to the next point. But how far do you follow the slope? There may be a lot of “bumps” in the landscape in which case following the slope at one point may become inaccurate after some distance (e.g. it may “overstep”). Shorter steps are computationally more expensive, since you must re-calculate the slope more frequently, but gives greater accuracy. For an ODE solver, this step size is controlled via the ∆t input. These two factors - the shape of the landscape and the time step - are main contributors to error here. CHAPTER 25. NONLINEAR DYNAMICS 665 25.2. FLOWS 666 For Forward Euler, the estimate of ⃗ x (t0 + ∆t) is computed as follows: ⃗ x (t0 + ∆t) = ⃗ x (t0 ) + ∆t · ⃗ x ′ (t) A related method is Backward Euler: ⃗ x (t0 + ∆t) = ⃗ x (t0 ) + ∆t · ⃗ xF′ E (t0 + ∆t) Where ⃗ xF E (t0 + ∆t)′ is not the derivative at the original point, but rather the derivative of the point reached after one time step of Forward Euler. Intuitively, this is like taking a “test step”, computing the derivative there, moving back to the start, and then and moving based on the derivative computed from the test step. Note that Forward Euler and Backward Euler have numerical damping effects. For Backward Euler, it is positive damping, so it acts sort of like friction; for Forward Euler it is negative. The results of these computational precision errors, however, are indistinguishable from natural effects, which makes them difficult to deal with. Note that Forward Euler is equivalent to the first part of a Taylor series, which is also used to approximate a point locally: 1 1 f (x0 + ∆x) = f (x0 ) + ∆x(f ′ (x0 )) + (∆x)2 (f ′′ (x0 )) + . . . (∆x n )f n (x0 ) 2 n! There are also other errors such as floating point errors - e.g. truncation or roundoff errors, depending on how the are handled. This is common with sensors. These