* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Artificial Intelligence (AI) Machine Learning and AI Pattern Recognition
Survey
Document related concepts
Quantum machine learning wikipedia , lookup
History of artificial intelligence wikipedia , lookup
Neural modeling fields wikipedia , lookup
Person of Interest (TV series) wikipedia , lookup
Barbaric Machine Clan Gaiark wikipedia , lookup
Concept learning wikipedia , lookup
Transcript
Artificial Intelligence (AI) • What is Artificial Intelligence? by John McCarthy. http://www-formal.stanford.edu/jmc/whatisai/ “After WWII, a number of people independently started to work on intelligent machines. The English mathematician Alan Turing may have been the first. He gave a lecture on it in 1947. He also may have been the first to decide that AI was best researched by programming computers rather than by building machines. By the late 1950s, there were many researchers on AI, and most of them were basing their work on programming computers.” • • •Towards complexity of real-world structures A survey of Machine Learning techniques Ant-colony example “The complex behavior of the ant colony is due to the complexity of its environment and not to the complexity of the ants themselves. This suggests the adaptive behavior of learning and representation and the path the science of the artificial should take.” (H.A. Simons, The Science of the Artificial, MIT Press, 1969) Arshia Cont Musical Representations Team, IRCAM. [email protected] ATIAM 2012-13 http://repmus.ircam.fr/atiam_ml_2012 ’11 Arshia Cont: Survey of Machine Learning Machine Learning and AI • • ’11 • The methods presented today are not domain-specific but for every problem, you start with a design, collect related data and then define the learning problem. We will not get into design today... . Keep in mind that, • AI is an empirical science! • See “Science of the Artificial” by H.A. Simons, MIT Press, 1969 apply algorithms blindly to your data/problem set! • •DOTheNOT MATLAB Toolbox syndrome: Examine the hypothesis and limitation of each approach before hitting enter! Do not forget your own intelligence! • Arshia Cont: Survey of Machine Learning Pattern recognition in action: • Applied A set of well-defined approaches each within its limits that can be applied to a problem set Classification / Pattern Recognition / Sequential Reasoning / Induction / Parameter Estimation etc. Our goal today is to introduce some well-known and wellestablished approaches in AI and Machine Learning • • Pattern Recognition Machine Learning deals with sub-problems in engineering and sciences rather than the global “intelligence” issue! • • • 2 3 ’11 Examples: • • Instrument Classification • • • • • • Music genre classification Audio to Score Alignment (score following) Automatic Improvisation Gesture Recognition Music Structure Discovery Concatenative Synthesis (unit selection) Artist Recovery Arshia Cont: Survey of Machine Learning 4 Pattern Recognition • Machine Learning Pattern recognition design cycle: • Examples: • • • • • • • • • • Instrument Classification Audio to Score Alignment (score following) Music genre classification Automatic Improvisation Gesture Recognition Music Structure Discovery Concatenative Synthesis (unit selection) Room Acoustics (parameter adaptation) • • • • • Is an empirical science Is extremely diverse Should keep you honest (and not the contrary!) Course objective: To get familiar with Machine Learning tools and reasoning and prepare you for attacking real-world problems in Music Processing etc! 5 Sample Example (I) ’11 Arshia Cont: Survey of Machine Learning Example • This is a typical Classification problem BayesSolution: decision rule is usually highly intuitive •theIntuitive Question: What should an optimal decoder do to recover Y from X ? the Bayes decision rule is usually highly intuitive is usually referred to as observation and is a random variable. • Xhave we used an example from communications we Threshold have used an example from communications on 0.5 • • a bit is transmitted by a source, corrupted by noise, and received Butalet’s make life more difficult! decoder • by In most problems, the real state of the world (y) is not observable to us! So we try atobit is transmitted by a source, corrupted by noise, and received infer this from the observation. by a decoder Y Y channel 6 Sample Example (I) Example • Communication theory: channel X X • Q: what should the optimal decoder do to recover Y? • Q: what should the optimal decoder do to recover Y? ’11 Has a profound theoretical background • Arshia Cont: Survey of Machine Learning •• Provide tools and reasoning for the design process of a given problem Physical Modeling (model learning/adaptation) ’11 • • Arshia Cont: Survey of Machine Learning 7 5 ’11 Arshia Cont: Survey of Machine Learning 5 8 Sample Example (I) Sample Example (I) Example • • Simple Solution 1: in• summary Define a decision function g(x) that predicts the state of the world (y). • • Simple Solution 2: • • " G ( x,0, ! ) P (0) " PY (1) " 1 2 X if I I am P thus assuming that the family ofY g(x) that generate X |Y ( x | 1) " G ( x,1, ! ) have Y (the inverse problem). andP learn(it! x | 0) X |Y • Try to find an optimal boundary (defined as g(x)) that can best separate the two. Define the decision function as + or - distance from this boundry. I am thus assuming that the family of g(x) that discriminate X classes. • or, graphically, g(x) 0 1 Example ’11 Arshia Cont: Survey of Machine Learning 15 ’11 9 Arshia Cont: Survey of Machine Learning 10 in summary PX |Y ( x | 0) " G ( x,0, ! ) SampleP Example ( x | 1) " G ( x,1, ! )(I) PY (0) " PY (1) " 1 X |Y • Sample Example (I) 2 • We just saw two •different philosophies to solve our simple or, graphically, problem: • Generative Design: • Discriminative Design: 0 In the real world things are not as simple • • Consider the following 2-dimensional problem Not hard to see the problem! 1 g(x) 15 ’11 Arshia Cont: Survey of Machine Learning 11 ’11 Arshia Cont: Survey of Machine Learning 12 Sample Example (I) • Sample Example (I) • In the real world things are not as simple • • Consider the following 2-dimensional problem 1. To what extend does our solution generalize to new data? • In the real world things are not as simple Consider the following 2-dimensional problem 1. To what extend does our solution generalize to new data? • The central aim of designing a classifier is to correctly classify novel input! The central aim of designing a classifier is to correctly classify novel input! 2. How do we know when we have collected adequately large and representative set of examples for training? ’11 Arshia Cont: Survey of Machine Learning 13 ’11 Sample Example (I) • 14 Sample Example (II) • • In the real world things are not as simple • Arshia Cont: Survey of Machine Learning Consider the following 2-dimensional problem This is a typical Regression problem Polynomial Curve Fitting 1. To what extend does our solution generalize to new data? • The central aim of designing a classifier is to correctly classify novel input! 2. How do we know when we have collected adequately large and representative set of examples for training? 3. How can we decide model complexity versus performance? ’11 Arshia Cont: Survey of Machine Learning 15 ’11 Arshia Cont: Survey of Machine Learning 16 Sample Example (II) • Sample Example (II) • Polynomial Curve Fitting • • Sum-of-squares Error Function ’11 Arshia Cont: Survey of Machine Learning Polynomial Curve Fitting 17 ’11 Arshia Cont: Survey of Machine Learning Sample Example (II) • ’11 • Polynomial Curve Fitting • 1st order polynomial Arshia Cont: Survey of Machine Learning 18 Sample Example (II) Polynomial Curve Fitting • 0th order polynomial 19 ’11 3rd order polynomial Arshia Cont: Survey of Machine Learning 20 Sample Example (II) • Sample Example (II) • Polynomial Curve Fitting • Polynomial Curve Fitting • 9th order polynomial Over-fitting Root$Mean$Square-(RMS)-Error: ’11 Arshia Cont: Survey of Machine Learning 21 ’11 Arshia Cont: Survey of Machine Learning Sample Example (II) • ’11 Sample Example (II) • Polynomial Curve Fitting • • Over-fitting and regularization Effect of data set size (9th order polynomial) Arshia Cont: Survey of Machine Learning 22 23 ’11 Polynomial Curve Fitting • • Regularization • 9th order polynomial with Penalize large coefficient values Arshia Cont: Survey of Machine Learning 24 Sample Example (III) • Important Questions • I want my computer to learn the “style” of Bach and to generate new Bach’s music that has not happened before. • • Harder to imagine.. But we’ll soon get there! Given that we have learned what we want... • If my g(x) can predict well on the data I have, will it also predict well on other sources of X I have not seen before? • OR To what extent the knowledge that has been learned applies to the whole world outside? OR how does my learning generalize itself? (Generalization) • Does having more data necessarily mean I learn better? • • Does having more complex models necessarily improve learning? • ’11 Arshia Cont: Survey of Machine Learning 25 ’11 Machine Learning Families Three Types of Learning No! Overfitting... . No! Regularization... . Arshia Cont: Survey of Machine Learning 26 Machine Learning Families • Imagine an organism or machine which experiences a series of sensory inputs: Goals of Supervised Learning Supervised Learning Families: Classification: The desired outputs yi are discrete class labels. The goal is to classify new inputs correctly (i.e. to generalize). x1, x2, x3, x4, . . . Supervised learning: The machine is also given desired outputs y1, y2, . . ., and its goal is to learn to produce the correct output given a new input. Regression: The desired outputs yi are continuous valued. The goal is to predict the output accurately for new inputs. Unsupervised learning: The goal of the machine is to build a model of x that can be used for reasoning, decision making, predicting things, communicating etc. Reinforcement learning: The machine can also produce actions a1, a2, . . . which affect the state of the world, and receives rewards (or punishments) r1, r2, . . .. Its goal is to learn to act in a way that maximises rewards in the long term. ’11 Arshia Cont: Survey of Machine Learning 27 ’11 Arshia Cont: Survey of Machine Learning 28 Machine Learning Models Machine Learning Models 1. Generative Learning When we start with the hypothesis that a family of parametric models can generate X given Y • When we do not assume a model over data, but assume a form on how they are separated from each other and fit it to discriminate classes... . • • The notion of prior model! • • Neural Networks, Kernel methods, Support Vector Machines etc. • Pros: • ’11 1. Discriminative Learning • At the core of Bayesian learning.. Subject of ongoing and historical philosophical debates. • We can incorporate our own belief and knowledge into the model and eventually test and refine it. • • In most cases simplifies the mathematical structure of the problem. • • Tautology?! • Guaranteed solutions exist in many situations! Cons: Pros: • • No curse of dimensionality (in most cases) • • Prior knowledge for discriminant factors are hard to imagine/justify.. . • Less appealing for applications where generation is also important... . Good when you can not formally describe the hidden generative process. Cons: For complicated problems, they “seem” less intuitive than Generative methods... . Curse of Dimensionality Arshia Cont: Survey of Machine Learning 29 ’11 Arshia Cont: Survey of Machine Learning 30 Probability Theory • Probability Theory • • ’11 ’11 A probabilistic model of the data can be used to • • • • Make inference about missing inputs Generate predictions/fantasies! Make decisions with minimized expected loss Communicate the data in an efficient way Statistical modeling is equivalent to other views of learning • • Information theoretic: Finding compact representations of the data Physics: Minimizing free energy of a corresponding mechanical system If not, what else? • • • knowledge engineering approach vs. Empirical induction approach Domain of Probabilities vs. Domain of Possibilities (fuzzy logic) Logic AI ... Arshia Cont: Survey of Machine Learning 32 Probability Theory Probability Theory Marginal-Probability Joint-Probability Sum-Rule Condi:onal-Probability ’11 Arshia Cont: Survey of Machine Learning Product-Rule 33 ’11 Probability Theory • Arshia Cont: Survey of Machine Learning 34 Probability Theory • Rules of Probability: Bayes’ Theorem: Sum-Rule Product-Rule • posterior-∝ likelihood-×-prior Independence: • Random variables X and Y are independent if p( X |Y ) = p( X ) ’11 Arshia Cont: Survey of Machine Learning 35 ’11 Arshia Cont: Survey of Machine Learning 36 Probability Theory • Probability Theory • Probability Densities Expectations Condi:onal-Expecta:on (discrete) Approximate-Expecta:on (discrete-and-con:nuous) ’11 Arshia Cont: Survey of Machine Learning 37 ’11 Probability Theory • ’11 38 Probability Theory Variances and Covariances Arshia Cont: Survey of Machine Learning Arshia Cont: Survey of Machine Learning 39 • The Gaussian Distribution • Gaussian Mean and Variance: ’11 Arshia Cont: Survey of Machine Learning 40 Generative vs.Example Discriminative Basic Rules of Probability Basic Rules of Probability Probabilities are non-negative P (x) Probabilities normalise: probability densities. x P (x) 0 ⇤x. = 1 for discrete distributions and • ⇥ in summary Generative approach: • • p(x)dx = 1 for PX |Y ( x | 0) " G ( x,0, ! ) PY (0) " PY (1) " 1 PX |Y ( x | 1) " G ( x,1, ! ) Model Use Bayes’ theorem 2 • or, graphically, The joint probability of x and y is: P (x, y). The marginal probability of x is: P (x) = y P (x, y). The conditional probability of x given y is: P (x|y) = P (x, y)/P (y) Bayes Rule: P (x, y) = P (x)P (y|x) = P (y)P (x|y) ⇥ P (y|x) = 0 • P (x|y)P (y) P (x) 1 g(x) Discriminative approach: 15 • Model directly Warning: I will not be obsessively careful in my use of p and P for probability density and probability distribution. Should be obvious from context. ’11 Arshia Cont: Survey of Machine Learning 41 ’11 Arshia Cont: Survey of Machine Learning 42 Bayesian Decision Theory • Framework for computing optimal decisions on problems involving uncertainty (probabilities) • Basic concepts: • Bayesian Decision Theory World: • • • • ’11 Instrument classification, Audio to Score Alignment, Observer: • • ’11 has states or classes, drawn from a random variable Y {violin, piano, trumpet, drums, ...} Y Y {note1, chord2, note3, trill4, ...} Measures observations (features), drawn from a random process X Instrument classification, X = M F CCf eatures Arshia Cont: Survey of Machine Learning Rn 44 Bayesian Decision Theory Basics of Bayesian Decision Theory • Bayesian (MAP) classifier Question: How to choose the best class given the data? • • • • • Minimize the probability of error by Choose the•Maximum A Posteriori (MAP) class: choosing maximum a posteriori (MAP) class: ˆ = argmax Pr ( ! i x ) ! Bayesian (MAP) ! i classifier Intuitively: Choose the most probable of class given • Minimize the probability error by the observation. • Intuitively right - choose most(MAP) probable choosing maximum a posteriori class:class in light of the observation But we don’t know Pr ( ! i x ) P!ˆ r(= argmax i |x) ! • Given models fori each distribution p ( x ! i ) , But we know P r(x| i ) • Intuitively right - choose most probable class in ˆ = argmax Pr ( ! i x ) the search for ! light of the observation ! Apply Bayes rule: • i Given models for each p distribution p ((x!!)i ) , ( x ! ) " Pr i i -------------------------------------------------becomes ˆ = argmax the search argmax for ! Pr ( ! x ) ! i #! ip ( x ! j )i " Pr ( ! j ) j p ( x ! i ) " Pr ( ! i ) the same- over all !i but denominator = p(x) is argmax -------------------------------------------------becomes ! i # p ( x ! j ) " Pr ( ! j ) ˆ = argmaxj p ( x ! i ) " Pr ( ! i ) hence ! but denominator! = p(x) is the same over all !i i ˆ = argmax p ( x ! i ) " Pr ( ! i ) hence ! ! i Recognition Musical Content - Dan Ellis 1: Pattern 2003-03-18 - 23 ’11 Arshia Cont: Survey of Machine Learning 45 Musical Content - Dan Ellis ’11 1: Pattern Recognition 2003-03-18 - 23 Arshia Cont: Survey of Machine Learning 46 Example the Bayes decision rule is usually highly intuitive Basics of Bayesian Decision Theory we have used an example from communications • Basics of Bayesian Decision Theory • a bit is transmitted by a source, corrupted by noise, and received Let’s go back to our sample example (I): by anow decoder Y channel • We need: • X Class probabilities: • • • Q: what should the optimal decoder do to recover Y? • ’11 Intuitively, the decision rule can be: Y = Arshia Cont: Survey of Machine Learning 0, if x > T 1, if x > T PY (0) = PY (1) = 1/2 Class-conditional densities: • • • in the absence of any other information let’s say Noise results from thermal processes, a lot of independent events that add up By the central limit theorem, it is reasonable to assume that noise is Gaussian Denote a Gaussian random variable of mean µ and variance by X N (µ, ⇥) 5 47 ’11 Arshia Cont: Survey of Machine Learning 48 Basics of Bayesian Decision Theory Basics of Bayesian Decision Theory Example Example • the Gaussian probability density function is PX ( x) % G ( x, " , ! ) % 1 2#! 2 e $ ( x$" )2 2! 2 PX |Y ( x | 0) " G ( x,0, ! ) channel X X % Y '&, PY (0) " PY (1) " 1 PX |Y ( x | 1) " G ( x,1, ! ) since noise is Gaussian, and assuming it is just added to the signal we have Y In summary: in summary 2 • or, graphically, & ~ N (0, ! 2 ) • in both cases, X corresponds to a constant (Y) plus zero-mean Gaussian noise • this simply adds Y to the mean of the Gaussian 0 1 14 ’11 Arshia Cont: Survey of Machine Learning 49 ’11 Basics of Bayesian Decision Theory • • ⇥ i (x) = arg max log PX|Y (x|i) + log PY (i) BDR Graphically this MAP solution is equal to: this is intuitive • we pick the class that “best explains” (gives higher probability) the observation i • in this case, we can solve visually and note that • • 50 Basics of Bayesian Decision Theory To compute the Bayesian Decision Rule (or MAP) we use log probabilities here: • 15 Arshia Cont: Survey of Machine Learning terms which are constant can be dropped Hence, if priors are equal, then we have: i (x) = arg max log PX|Y (x|i) 0 i 1 pick 0 pick 1 • but the mathematical solution is equally simple 17 ’11 Arshia Cont: Survey of Machine Learning 51 ’11 Arshia Cont: Survey of Machine Learning 52 Basics of Bayesian Decision Theory • Basics of Bayesian Decision Theory BDR BDR Now let’s consider the general case: • or let’s consider the more general case PX |Y ( x | 0) # G ( x, " 0 , ! ) PX |Y ( x | 1) # G ( x, "1 , ! ) ( x $ !i ) 2 2" 2 i 2 2 % arg min ( x $ 2 x!i # !i ) i* % arg min i • for which % arg min ($2 x!i # !i ) 2 i i ( x) # arg max log PX |Y ( x | i ) * i ( x % "i ) +, 1 % 2 # arg max log * e 2! 2 i ,) 2$! 2 • the optimal decision is, therefore (, ' ,& • pick 0 if $ 2 x! 0 # ! 0 & $2 x!1 # !1 2 2 x( !1 $ ! 0 ) & !1 $ ! 0 2 + 1 ( x % "i ) 2 ( # arg max *% log(2$! 2 ) % ' 2! 2 & i ) 2 ( x % "i ) 2 # arg min 2! 2 i 2 2 • or, pick 0 if x& !1 # !0 2 18 ’11 19 Arshia Cont: Survey of Machine Learning 53 ’11 Arshia Cont: Survey of Machine Learning 54 BDR for a problem with Gaussian classes, equal variances andBasics equal class probabilitiesDecision Theory of Bayesian Basics of Bayesian Decision Theory • optimal decision boundary is the threshold “Yeah! So what?” •BDR Bayesian Decision Theory keeps you honest! • what is the point of going through all the math? the mid-point between the two means • •Oratgraphically, •• • !0 pick 0 • In practice we never have one variable but a vector of observations >> now we know that the intuitive threshold is actually optimal, and Multivariate Gaussians in which sense it is optimal (minimum probability or error) Priors are not uniform. • the Bayesian solution keeps us honest. It• also forces us to make our assumptions explicit! it forces us to make all our assumptions explicit • assumptions we have made !1 PY (0) ! PY (1) ! 1 • uniform class probabilities pick 1 • PX |Y ( x | i ) ! G ( x, #i , " i ) • Gaussianity • the variance is the same under the two states 20 2 " i ! " , $i X !Y &% • noise is additive • even for a trivial problem, we have made lots of assumptions 22 ’11 Arshia Cont: Survey of Machine Learning 55 ’11 Arshia Cont: Survey of Machine Learning 56 Gaussian Classifiers • Gaussian Classifiers The Gaussian classifier Let’sThe imagine the Bayesian Decision Rule (BDR) for the case of Gaussian classifier two multivariate in this case Gaussian case: di i i t discriminant: PY|X(1"x ) % 0'5 this can be written as i * (x ) & arg min!d i (x , $i ) % # i " $ 1 " PX |Y ( x | i ) ) exp#& ( x & *i )T % i&1 ( x & *i )! d 2 ( ' (2+ ) | % i | 1 i with • the BDR d i (x , y ) & (x ' y )T (i'1 (x ' y ) i * (x ) ) arg max,log PX |Y (x | i ) . log PY (i )i # i & log(2) )d (i ' 2 log PY (i ) • becomes 0 1 3 2 1 2 & log( l (2+ )d %i . log l PY (i )1 2 4 the optimal rule is to assign x to the closest class i * (x ) ) arg max /& (x & *i )T %i&1 (x & *i ) i closest is measured with the Mahalanobis distance di(x,y) to which the # constant is added to account for the class prior 11 12 ’11 Arshia Cont: Survey of Machine Learning ’11 57 Arshia Cont: Survey of Machine Learning Gaussian Classifiers Gaussian Classifiers • i⇥ (x) In detail: = arg min (x = arg min xT 1 x xT = arg min xT i ⇤ 1 x 2µTi = i µi )T i ⌥ T arg max ⌥ µi i ⇧↵ ⌦ wiT ’11 Arshia Cont: Survey of Machine Learning 58 59 ’11 1 x 1 (x 1 T µi ↵2 µi ) 1 1 x + µTi 1 1 ⇥ 2 log PY (i) ⇥ 2 log PY (i) ⌅ x + µTi µTi µi 1 ⇥ 2 log PY (i) µi 1 µi µi + 2 log PY (i)⌃ ⌦ wi0 Arshia Cont: Survey of Machine Learning 60 Gaussian Classifiers Gaussian Classifiers The Gaussian classifier Group Homework 0 discriminant: PY|X(1"x ) % 0'5 in summary, Find the Geometric equation for the hyperplane separating the two classes for the linear discriminant of the Gaussian classifier. i * ( x) ! arg max g i ( x) i Hint: This is the set such that • with g i ( x) ! wiT x " wi 0 wi ! # $1%i 1 T wi 0 ! $ %i # $1%i " log PY (i ) 2 • the BDR is a linear function or a linear discriminant 15 ’11 Arshia Cont: Survey of Machine Learning 61 The role of prior ’12 • prior can offset the “threshold” value (in our simple Geometric • Theinterpretation example): boundary hyper-plane yp p in 1, 2, and 3D Arshia Cont: Survey of Machine Learning 62 The role of covariance So far, our covariance matrices were simple! If they are different, the the nice hyper-plane for a 2-class problem becomes hyper-quadratic: Geometric interpretation in 2 and 3D: ffor various i prior configurations 36 ’11 Arshia Cont: Survey of Machine Learning 63 29 ’11 Arshia Cont: Survey of Machine Learning 64 Bayesian Decision Theory • • Advantages • • • • BDR is optimat and can not be beaten! • • No need for heuristics to combine these two sources of information Bayes keeps you honest Maximum Likelihood Estimation Models reflect causal interpretation of the problem, or how we think! Natural decomposition into “what we knew already” (prior) and “what data tells us” (obs) BDR is intuitive Problems • BDR is optimal ONLY if the models are correct! ’11 Arshia Cont: Survey of Machine Learning ’11 65 Bayesian Decision Theory Bayesian Decision Theory Advantages WHAT??? BDR is optimal and can not be beaten! We do have an optimal (and geometric) solution: Bayes keeps you honest i⇤ (x) = arg max[µTi ⌃ 1 x i | {z } Models reflect causal interpretation of the problem, or how we think! wiT Natural decomposition into “what we knew already” (prior) and “what data tells us” (obs) 1 T µi ⌃ |2 1 µi + 2 log PY (i)] {z } wi0 but we do not know the values of the parameters µ, ⌃, PY (i) We have to estimate these values! No need for heuristics to combine these two sources of information We can estimate from a training set BDR is intuitive example: use the average value as an estimate for the mean! Problems BDR is optimal ONLY if the models are correct! ’12 Arshia Cont: Survey of Machine Learning 67 ’12 Arshia Cont: Survey of Machine Learning 68 Maximum Likelihood Maximum likelihood Maximum Likelihood • this seems pretty serious We rely on the maximum likelihood (ML) principle. – how should I get these probabilities then? ML has three main steps: • we rely on the maximum likelihood (ML) principle • this has three steps: 1. Choose a parametric model for all probabilities (as a function of unknown parameters) – 1) we choose a parametric model for all probabilities – to make this clear we denote the vector of parameters by ! and the class-conditional distributions by 2. Assemble a training data-set 3. Solve for parameters that maximize probabilities on the data-set! PX |Y ( x | i; !) – note that this means that ! is NOT a random variable (otherwise it would have to show up as subscript) – it is simply a parameter, and the probabilities are a function of this parameter 10 ’12 Arshia Cont: Survey of Machine Learning 69 Maximum Likelihood ’12 Maximum likelihood Maximum Likelihood • three steps: • since – 2) we assemble a collection of datasets !(i) = {x1(i) , ..., xn(i)} set of examples drawn independently from class i – each sample !(i) is considered independently – parameter t !i estimated ti t d only l ffrom sample l !(i) • we simply have to repeat the procedure for all classes • so, so from now on we omit the class variable – 3) we select the parameters of class i to be the ones that maximize the probability of the data from that class ! # i $ arg max PX |Y D (i ) | i; # !* $ arg max PX "D; ! # " ! $ arg g max log g PX |Y|Y D ( i ) | i; # # ! $ arg max log PX "D; ! # " ! • the function PX(!;!) is called the likelihood of the parameter ! with respect to the data • or simply i l the h likelihood lik lih d ffunction i – like before,, it does not really y make any y difference to maximize probabilities or their logs 11 ’12 70 Maximum Likelihood Maximum likelihood # Arshia Cont: Survey of Machine Learning Arshia Cont: Survey of Machine Learning 12 71 ’12 Arshia Cont: Survey of Machine Learning 72 Maximum Maximum Likelihood likelihood Maximum Likelihood givenGiven a sample, to obtain ML weforneed to solve In• short: some data-points, weestimate are solving The gradient: #* $ arg max PX !D; # " in higher dimensions, the generalization of the derivative is the gradient. The gradient of a function f(x) at z is: # rf (z) = If• ⇥ is scalar, is high-school calculus! when # isthis a scalar this is high-school high school calculus We have maximum when first derivative is zero + second derivative is negative! ✓ ◆T @f @f (z), . . . , (z) @x0 @xn 1 It has a nice geometric interpretation: It points in the direction of maximum growth of the function We review higher dimensional tools for this aim very quick... . Perpendicular to the contour where the function is constant • we have a maximum when – first derivative is zero – second derivative is negative ’12 Arshia Cont: Survey of Machine Learning 14 73 Maximum Likelihood ’12 Arshia Cont: Survey of Machine Learning 74 Maximum Likelihood The gradient The Hessian: extension of the 2nd-order derivative is the Hessian Matrix: 2 6 6 6 6 6 6 r2 f (x) = 6 6 6 6 6 4 @2f @x20 @2f @x0 @x1 @2f @x1 @x0 @2f @x21 ··· .. . .. . .. @2f @xn 1 @x0 @2f @xn 1 @x1 ··· . ··· @2f @x0 @xn @2f @x1 @xn .. . @2f @x2n 1 1 3 7 7 7 7 17 7 7 7 7 7 7 5 In an ML setup we have a maximum when Hessian is negative definite or xT r2 f (x)x 0 ’12 Arshia Cont: Survey of Machine Learning 75 ’12 Arshia Cont: Survey of Machine Learning 76 Maximum Likelihood Maximum Likelihood The Hessian: In summary: 1. Choose a parametric model for probabilities PX (x; ⇥) 2. Assemble D = {X1 , . . . , Xn } of independently drawn examples 3. Select parameters that maximize the probability of the data or Given a data-set we need to solve ⇥⇤ = arg max PX (D; ⇥) ⇥ = arg max log PX (D; ⇥) ⇥ The solutions are the parameters such that r⇥ PX (D; ⇥) = 0 ✓ r2⇥ PX (D; ✓)✓ t ’12 Arshia Cont: Survey of Machine Learning 77 Maximum Likelihood ’12 Arshia Cont: Survey of Machine Learning 78 ML + BDR Sample Example II: Polynomial Regression Going back to our simple classification problem... . We can combine ML and Bayesian Decision Rule to make things safer and pick the desired class i if: Two random variables X and Y Example A dataset of examples in⇤summary i (x) = arg max PX|Y (x|i; ✓i⇤ )PY (i) D = {(X1 , Y1 ), . . . , (Xn , Yn )} A parametric model of the form y = f (x; ⇥) + ✏ where ✏ ⇠ N (0, Concretely, f (x; ⇥) = 0, 8✓ 2 Rn K X PX |Y ( x | 0) " G ( xi,0, ! ) Y 2 1 P (0) " P (1) " ⇤ 2 ✓) where P ( x | 1✓ ) i" G= ( x,1arg , ! ) max PX|Y (D|i, X |Y ) Y ✓ • or, graphically, ✓i xi i=0 where the data is distributed as PZ|X (D|x; ⇥) = G(z, f (x; ⇥), Show that ⇥ = [ ⇤ T ] 1 T y where GROUP I (for next class) ’12 2 1 6 6 6 .. = 6. 6 4 1 Arshia Cont: Survey of Machine Learning ··· .. . ··· 2 ) 3 xK 1 7 7 .. 7 . 7 7 5 xK n 0 1 15 79 ’12 Arshia Cont: Survey of Machine Learning 80 Estimators Estimators We now know how to produce estimators using MaximumLikelihood... . Example ML estimator for the mean of a Gaussian How do we evaluate an estimator? Using bias and variance Bias(µ̂) = EX1 ,...,Xn [µ̂ µ] = EX1 ,...,Xn [µ̂] µ 1X = EX1 ,...,Xn [Xi ] n i 1X = EXi [Xi ] µ n i Bias A measure how the expected value is equal to the true value ˆ = EX ,...,X [f (X1 , . . . , Xn ) If ✓ˆ = f (X1 , . . . , Xn ) then Bias(✓) 1 n ✓] An estimator that has bias will usually not converge to the perfect estimate! No matter how large the data-set is! Variance = µ Given a good bias, how many sample points do we need? ˆ = EX ,...,X f (X1 , . . . , Xn ) V ar(✓) 1 n N (µ, 2 ) µ µ=0 The estimator is thus unbiased EX1 ,...,Xn [f (X1 , . . . , Xn )]2 Variance usually decreases with more training examples... . ’12 Arshia Cont: Survey of Machine Learning 81 ’12 Arshia Cont: Survey of Machine Learning 82 Bayesian Parameter Estimation • Bayesian parameter estimation is an alternative framework for parameter estimation • There is fundamental difference between Bayesian and ML methods! Bayesian Parameter Estimation • • To understand this, we need to distinguish between two components: • • • ’11 The long debate between frequentists vs Bayesians ’11 The definition of probability (intact) The assessment of probability (differs) We need to review these fundamentals! Arshia Cont: Survey of Machine Learning 84 Probability Measures • This does not change between frequentist and Bayesian philosophies • Probability measure satisfies three axioms: P (A) Frequentist vs Bayesian • • Difference is in interpretation! Frequentist view: • • • 0 8 events A P (universal event) = 1 \ if A B = ; then P (A + B) = P (A) + P (B) Problems: 85 In most cases we do not have large number of observations! In most cases probabilities are not objective! This is not usually how people behave. Bayesian view: • • Arshia Cont: Survey of Machine Learning Make sense when we have a lot of observations (no bias) • • • • ’11 Probabilities are relative frequencies Probabilities are subjective (not equal to relative count) Probabilities are degrees of belief on the outcome of experiment ’11 Arshia Cont: Survey of Machine Learning Bayesian Parameter Estimation • • Bayes vs. ML • Difference with ML: ⇥ is a random variable. Training set D = {X1 , . . . , Xn } of examples drawn independently Probability density for observations given parameter • PX|⇥ (x|⇥) • Prior distribution for parameter configurations P⇥ (✓) encodes prior belief on • Optimal estimate • • • Basic concept: • • 86 ⇥ Goal: Compute the posterior distribution under ML there is one “best” estimate under Bayes there is no “best” estimate It makes no sense under Bayes to talk about “best” estimate Predictions • We do not really care about the parameters themselves! Only in the fact that they build models... . • • Models can be used to make predictions Unlike ML, Bayes uses ALL information in the training set to make predictions P⇥|X (⇥|D) ’11 Arshia Cont: Survey of Machine Learning 87 ’11 Arshia Cont: Survey of Machine Learning 88 Bayes vs ML Bayes vs ML • let’s consider the BDR under the “0-1” loss and an independent sample D = {x1 , ..., xn} • note that we know that information is lost – e.g. we can’t even know how good of an estimate !* is – unless we run multiple experiments and measure bias/variance • ML-BDR: – pick i if ! • Bayesian BDR " – under the Bayesian framework, everything is conditioned on the training data – denote T = {X1 , ..., Xn} the set of random variables from which the training g sample p D = {{x1 , ...,, xn} is drawn i ( x) $ arg max PX |Y x | i;# PY (i ) * i * i where # i* $ arg max PX |Y !D | i, # " # • B-BDR: • two steps: – pick i if – i) find #* – ii) plug into the BDR " # i * ( x) $ arg max PX |Y ,T x | i, Di PY (i ) • all information not captured by #* is lost, not used at d i i titime decision i • the th decision d i i iis conditioned diti d on the th entire ti training t i i sett 20 ’11 Arshia Cont: Survey of Machine Learning 21 89 ’11 Arshia Cont: Survey of Machine Learning Bayesian BDR Bayesian BDR • to compute the conditional probabilities, we use the marginalization equation • in summary ! " ! " ! – pick i if " i • note 1: when the parameter value is known, x no longer depends on T, e.g. X|$ ~ N(&,'2) where PX |Y ,T i X |Y ,& ! " i – as before the bottom equation is repeated for each class – hence, we can drop the dependence on the class – and consider the more g general problem p of estimating g PX |Y ,T x | i, Di % # PX |$,Y ! x | & , i "P$|Y ,T & | i, Di d& • note 2: once again can be done in two steps (per class) PX |T ! x | D " % $ PX |& ! x | # "P&|T !# | D "d# – i) find P$|T(&|Di) – ii) compute PX|Y,T(x|i, Di) and plug into the BDR • no training information is lost 23 22 ’11 &|Y ,T • note: – we can, simplify equation above into " ! " !x | i, D " % $ P !x | i,# "P !# | i, D "d# i * ( x) % arg max PX |Y ,T x | i, Di PY (i ) PX |Y ,T x | i, Di % # PX |$,Y ,T x | & , i, Di P$|Y ,T & | i, Di d& ! 90 Arshia Cont: Survey of Machine Learning 91 ’11 Arshia Cont: Survey of Machine Learning 92 Predictive Distribution • The distribution PX|T (x|D) = Z MAP approximation • this sounds good, why use ML at all? • the main problem with Bayes is that the integral PX|⇥ (x|✓)P⇥|T (✓|D)d✓ is known as the predictive distribution. It allows us • • PX |T ! x | D " & % PX |$ ! x | # "P$|T !# | D "d# to predict the value of x given ALL the information in the training set can be quite nasty • in practice one is frequently forced to use approximations • one possibility is to do something similar to ML, i.e. pick only one model • this can be made to account for the prior by Bayes vs. ML: • • • ML picks one model, Bayes averages all models ML is a special case of Bayes when we are very confident about the model In otherwords ML~Bayes when • • • • – picking the model that has the largest posterior probability given the training data prior is narrow if the sample space is quite large # MAP & argg max P$|T !# | D " intuition: Given a lot of training data, there is little uncertainty # 28 Bayes regularizes the ML estimate! ’11 Arshia Cont: Survey of Machine Learning 93 ’11 Arshia Cont: Survey of Machine Learning MAP approximation MAP approximation • this can usually be computed since • in this case # MAP % arg max P$|T !# | D " PX |T ! x | D " & ' PX |( ! x | # "$ !# % # MAP "d# # & PX |( ! x | # MAP " % arg max PT |$ !D | # "P$ !# " # and corresponds to approximating the prior by a delta function centered at its maximum P$|T !# | D " 94 • the BDR becomes – pick i if ! " i * ( x) & arg max PX |Y x | i;# iMAP PY (i ) P$|T !# | D " i where # iMAP & arg max PT |Y ,( !D | i, # "P(|Y !# | i " # #MAP ’11 # Arshia Cont: Survey of Machine Learning #MAP # – when compared to the ML this has the advantage of still accounting for the prior (although only approximately) 30 29 95 ’11 Arshia Cont: Survey of Machine Learning 96 Bayesian Learning MAP vs ML Bayesian Learning • ML-BDR – pick i if ! Summary " i * ( x) $ arg g max PX |Y x | i;# i* PY (i ) Apply the basic rules of probability to learning from data. Data set: D = {x1, . . . , xn} Models: m, m etc. where # i* $ arg max PX |Y !D | i, # " Prior probabilities on models: P (m), P (m ) etc. i # Prior probabilities on model parameters: e.g. P ( |m) Model of data given parameters: P (x| , m) • Bayes MAP-BDR – pick i if ! " If the data are independently and identically distributed then: i * ( x) $ arg max PX |Y x | i;# iMAP PY (i ) i n P (D| , m) = where # iMAP $ arg max PT |Y ,% !D | i, # "P%|Y !# | i " i=1 # – the difference is non-negligible only when the dataset is small P ( |D, m) = P (D| , m)P ( |m) P (D|m) Posterior probability of models: 31 Arshia Cont: Survey of Machine Learning P (xi| , m) Posterior probability of model parameters: • there are better alternative approximations pp ’11 Model parameters: P (m|D) = 97 Example ’11 P (m)P (D|m) P (D) Arshia Cont: Survey of Machine Learning 98 Example communications problem the BDR is • pick “0” if Y X atmosphere x' receiver !0 & "% !0 # 2 $0 this is optimal and everything works wonderfully, but • one day we get a phone call: the receiver is generating a lot of errors! two states: • Y=0 transmit signal s = -!0 • there is a calibration mode: • Y=1 transmit signal s = !0 • rover can send a test sequence noise model • but it is expensive, can only send a few bits X % Y $#, • if everything is normal, received means should be !0 and –!0 # ~ N (0, " ) 2 7 ’12 Arshia Cont: Survey of Machine Learning 99 8 ’12 Arshia Cont: Survey of Machine Learning 100 Example Bayesian solution action: • Gaussian likelihood (observations) ask the system to transmit a few 1s and measure X • compute the ML estimate of the mean of X !# PT |! ( D | ! ) # G ( D, ! , " 2 ) 1 " Xi n i Gaussian prior (what we know) result: the estimate is different than !0 P! ( ! ) # G ( ! , ! 0 , " 02 ) we need to combine two forms of information • our prior is that " 2is known ! !0,"02 are known hyper-parameters ! ~ N ( !0 , $ 2 ) we need to compute • our “data driven” estimate is that • posterior distribution for ! X ~ N ( !ˆ , $ ) P! |T ( ! | D) # 2 PT |! ( D | ! ) P! ( ! ) PT ( D) 9 ’12 Arshia Cont: Survey of Machine Learning 101 Bayesian solution the posterior distribution is + P- |T ( - | D) ) G - , - n , * n2 -n ) * 02 . xi ( - 0* 2 i * 2 ( n* 02 ! 10 ’12 Arshia Cont: Survey of Machine Learning 102 Bayesian solution for free, Bayes also gives us , -n ) • the weighting constants "n $ n* 02 *2 ( -0 ML * 2 ( n* 02 * 2 ( n* 02 $!#! " $!#! " /n n! 02 ! 2 # n! 02 • a measure of the uncertainty of our estimate 1 10/ n ! 1 1 n ' * 2* 2 $ * n2 ) %% 2 0 2 "" ! 2 ) 2 ( 2 *n *0 * & * ( n* 0 # 2 n $ 1 ! 2 0 # n !2 • note that 1/!2 is a measure of precision • this should be read as this is intuitive PBayes = PML + Pprior • Bayesian precision is greater than both that of ML and prior 11 ’12 Arshia Cont: Survey of Machine Learning 103 12 ’12 Arshia Cont: Survey of Machine Learning 104 Observations Observations • 1) note that precision increases with n, variance goes to zero 1 ! 2 n # 1 ! 2 0 " • 3) for a given n n n! 02 +n * 2 ! ) n! 02 !2 we are guaranteed that in the limit of infinite data we have convergence to a single estimate + n # * n +ˆ " $1 ) * n %+ 0 * n ( [0,1], * n & 1, * n & 0 • n &0 the solution is equivalent to that of ML • for small n, the prior dominates • this always happens for Bayesian solutions n %& n %0 if !02>>!2, i.e. we really don’t know what " is a priori then "n = "ML • 2) for large n the likelihood term dominates the prior term n &' " n * + n "ˆ ) #1 ( + n $" 0 + n ' [0,1], + n % 1, + n % 0 on the other hand, if !02<<!2, i.e. we are very certain a priori, then "n = "0 in summary, • Bayesian estimate combines the prior beliefs with the evidence provided by the data P+ |T ( + | D) - , PX |+ ( xi | + )P+ ( + ) • in a very intuitive manner i 13 ’12 Arshia Cont: Survey of Machine Learning 105 14 ’12 Arshia Cont: Survey of Machine Learning 106 Group Homework 2 Conjugate priors Histogram Problem note that • the prior P$ ( $ ) % G !$ , $0 , # 02 " is Gaussian Imagine a random variable X such that, PX (k) = ⇡k , k 2 1, . . . , N • the posterior P$ |T ( $ | D ) % G !x , $n , # n " is Gaussian Suppose we draw n independent observations from X and form a random vector C = (C1 , · · · , CN )T where Ck is the number of times where the observed value is k 2 whenever this is the case (posterior in the same family as prior) we say that C is then a histogram and has a multinomial distribution: • P$ ( $ ) is a conjugate prior for the likelihood PX |$ (x | $ ) n! PC1 ,...,CN (c1 , . . . , cN ) = QN • posterior P$ |T ( $ | D ) is the reproducing density k=1 ck ! j=1 HW: a number of likelihoods have conjugate priors Likelihood Bernoulli Poisson Exponential Normal (known #2) N Y c ⇡j j Note that ⇡ = (⇡1 , . . . , ⇡w ) are probabilities and thus: ⇡i 1. Derive the ML estimate for parameters ⇡k , k 2 {1, ..., N } Conjugate prior Beta Gamma Gamma Gamma 0 , X ⇡i = 1 hint: If you know about lagrange multipliers, use them! Otherwise, keep in mind that minimizing for a function f(a,b) constraint to a+b=1 is equivalent to minimizing for f(a,1-a). 16 ’12 Arshia Cont: Survey of Machine Learning 107 ’12 Arshia Cont: Survey of Machine Learning 108 Group Homework 2 Histogram Problem 2. Derive the MAP solution using Dirichlet priors: One possible prior model over ⇡k is the Dirichlet Distribution: PN N ( j=1 )uj ) Y u 1 ⇡j j P⇧1 ,...,⇧N (⇡1 , . . . , ⇡N ) = QN j=1 (uj ) j=1 where u is the set of hyper-parameters (prior parameters to solve) and Z 1 (x) = e t tx 1 dt is the Gamma function. O You should show that the posterior is equal to: PW W ( j=1 cj + uj ) Y c +u ⇡j j j P⇧|C (⇡|c) = QW k=1 (cj + uj ) j=1 1 3. Compare the MAP estimator with that of ML in part (1). What is the role of this prior compared to ML? ’12 Arshia Cont: Survey of Machine Learning 109