Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning Generative verses discriminative classifier Eric Xing Lecture 2, August 12, 2010 Reading: © Eric Xing @ CMU, 2006-2010 1 Generative and Discriminative classifiers Goal: Wish to learn f: X Y, e.g., P(Y|X) Generative: Modeling the joint distribution of all data Discriminative: Modeling only points at the boundary © Eric Xing @ CMU, 2006-2010 2 Generative vs. Discriminative Classifiers Goal: Wish to learn f: X Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers (e.g., logistic regression) Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Yn Xn Yn Xn Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 3 Suppose you know the following … Class-specific Dist.: P(X|Y) p( X | Y 1) p1 ( X ; 1 , 1 ) p( X | Y 2) p2 ( X ; 2 , 2 ) Class prior (i.e., "weight"): P(Y) This is a generative model of the data! © Eric Xing @ CMU, 2006-2010 Bayes classifier: 4 Optimal classification Theorem: Bayes classifier is optimal! That is How to learn a Bayes classifier? Recall density estimation. We need to estimate P(X|y=k), and P(y=k) for all k © Eric Xing @ CMU, 2006-2010 5 Gaussian Discriminative Analysis learning f: X Y, where X is a vector of real-valued features, Xn= < Xn,1…Xn,m > Y is an indicator vector Yn What does that imply about the form of P(Y|X)? The joint probability of a datum and its label is: Xn p(x n , ynk 1 | , ) p( ynk 1) p(x n | ynk 1, , ) k 1 2 1 exp ( x ) 2 n k 2 (2 2 )1/ 2 Given a datum xn, we predict its label using the conditional probability of the label given the datum: k 1 2 1 exp ( x ) 2 n k 2 (2 2 )1/ 2 k p ( yn 1 | x n , , ) 1 exp 21 2 (x n k ' ) 2 k' 2 1/ 2 (2 ) k' © Eric Xing @ CMU, 2006-2010 6 Conditional Independence X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write e.g., Equivalent to: © Eric Xing @ CMU, 2006-2010 7 Naïve Bayes Classifier When X is multivariate-Gaussian vector: Yn The joint probability of a datum and it label is: p(x n , ynk 1 | , ) p( ynk 1) p(x n | ynk 1, , ) k Xn T 1 1 1 exp ( x ) ( x n k n k) 2 1/ 2 (2 ) The naïve Bayes simplification Yn p(x n , y 1 | , ) p( y 1) p( xn , j | y 1, k , j , k , j ) k n k n k n j k j 1 (2 ) 2 1/ 2 k, j exp - 212 ( xn , j - k , j ) 2 k,j Xn,1 Xn,2 … Xn,m m More generally: p(x n , yn | , ) p( yn | ) p( xn, j | yn , ) j 1 Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density © Eric Xing @ CMU, 2006-2010 8 The predictive distribution Understanding the predictive distribution k p ( y 1 , x | , , ) n n p( ynk 1 | xn , , , ) p( xn | , ) Under naïve Bayes assumption: p( ynk 1 | xn , , , ) k N ( xn , | k , k ) * k ' k ' N ( xn , | k ' , k ' ) 1 j j 2 ( x ) log C n k k, j 2 2 k , j k exp j 1 j j 2 exp ( x ) log C k ' k ' j 2 2 n k ' k ', j k ', j ** For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be a logistic function 1 1 p( yn1 1 | xn ) 1 1e 1 2 exp 1 exp j 21 2j ( xnj -2j )2 log j C 1 j j 2 ( xn - 1 ) log j C j 2 2 j 1 exp j xnj 1 2j ( 1j - 2j ) 12 ([ 1j ]2 - [ 2j ]2 ) log j (1 1 ) 1 T xn © Eric Xing @ CMU, 2006-2010 9 The decision boundary The predictive distribution p( yn1 1 | xn ) 1 M 1 exp j xnj 0 j 1 1 1 e T xn The Bayes decision rule: 1 1 T xn p ( y n 1 | xn ) 1 e ln ln T p( yn2 1 | xn ) e xn T 1 e xn Tx n For multiple class (i.e., K>2), * correspond to a softmax function p( ynk 1 | xn ) e kT xn e Tj xn j © Eric Xing @ CMU, 2006-2010 10 Bayesian estimation © Eric Xing @ CMU, 2006-2010 11 Generative vs. Discriminative Classifiers Goal: Wish to learn f: X Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers: Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Yi Xi Yi Xi Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 12 Linear Regression The data: ( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ),, ( xN , yN ) Both nodes are observed: X is an input vector Y is a response vector Yi Xi N (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator) A regression scheme can be used to model p(y|x) directly, rather than p(x,y) © Eric Xing @ CMU, 2006-2010 13 Linear Regression Assume that Y (target) is a linear function of X (features): e.g.: let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be: then we have the following general representation of the linear function: Our goal is to pick the optimal We seek . How! that minimize the following cost function: 1 n J ( ) ( yˆ i ( xi ) yi )2 2 i 1 © Eric Xing @ CMU, 2006-2010 14 The Least-Mean-Square (LMS) method Consider a gradient descent algorithm: j t 1 j t t Now we have the following descent rule: j J ( ) j t 1 n T j ( yi x i t ) xij t i 1 For a single training point, we have: j t 1 j t ( yi xi T t ) xij This is known as the LMS update rule, or the Widrow-Hoff learning rule This is actually a "stochastic", "coordinate" descent algorithm This can be used as a on-line algorithm © Eric Xing @ CMU, 2006-2010 15 Probabilistic Interpretation of LMS Let us assume that the target variable and the inputs are related by the equation: yi T xi i where ε is an error term of unmodeled effects or random noise Now assume that ε follows a Gaussian N(0,σ), then we have: ( yi T xi )2 1 p( yi | xi ; ) exp 2 2 2 By independence assumption: n T 2 ( y x ) 1 i i L( ) p( yi | xi ; ) exp i 1 2 2 2 i 1 n n © Eric Xing @ CMU, 2006-2010 16 Probabilistic Interpretation of LMS, cont. Hence the log-likelihood is: l ( ) n log Do you recognize the last term? Yes it is: 1 1 1 n 2 i 1 ( yi T xi )2 2 2 1 n T J ( ) (xi yi )2 2 i 1 Thus under independence assumption, LMS is equivalent to MLE of θ ! © Eric Xing @ CMU, 2006-2010 17 Classification and logistic regression © Eric Xing @ CMU, 2006-2010 18 The logistic function © Eric Xing @ CMU, 2006-2010 19 Logistic regression (sigmoid classifier) The condition distribution: a Bernoulli p( y | x) ( x) y (1 ( x))1 y where is a logistic function ( x) 1 1e T x We can used the brute-force gradient method as in LR But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …) © Eric Xing @ CMU, 2006-2010 20 Training Logistic Regression: MCLE Estimate parameters =<0, 1, ... m> to maximize the conditional likelihood of training data Training data Data likelihood = Data conditional likelihood = © Eric Xing @ CMU, 2006-2010 21 Expressing Conditional Log Likelihood Recall the logistic function: and conditional likelihood: © Eric Xing @ CMU, 2006-2010 22 Maximizing Conditional Log Likelihood The objective: Good news: l() is concave function of Bad news: no closed-form solution to maximize l() © Eric Xing @ CMU, 2006-2010 23 The Newton’s method Finding a zero of a function © Eric Xing @ CMU, 2006-2010 24 The Newton’s method (con’d) To maximize the conditional likelihood l(): since l is convex, we need to find * where l’(*)=0 ! So we can perform the following iteration: © Eric Xing @ CMU, 2006-2010 25 The Newton-Raphson method In LR the is vector-valued, thus we need the following generalization: is the gradient operator over the function H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010 26 The Newton-Raphson method In LR the is vector-valued, thus we need the following generalization: is the gradient operator over the function H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010 27 Iterative reweighed least squares (IRLS) Recall in the least square est. in linear regression, we have: which can also derived from Newton-Raphson Now for logistic regression: © Eric Xing @ CMU, 2006-2010 28 Logistic regression: practical issues NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations Quasi-Newton methods, that approximate the Hessian, work faster. Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. Stochastic gradient descent can also be used if N is large c.f. perceptron rule: © Eric Xing @ CMU, 2006-2010 29 Generative vs. Discriminative Classifiers Goal: Wish to learn f: X Y, e.g., P(Y|X) Generative classifiers (e.g., Naïve Bayes): Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data! Estimate parameters of P(X|Y), P(Y) directly from training data Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers: Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data! Yi Xi Yi Xi Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 30 Naïve Bayes vs Logistic Regression Consider Y boolean, X continuous, X=<X1 ... Xm> Number of parameters to estimate: NB: LR: 1 2 ( x ) log C j k, j k, j 2 2 k , j p ( y | x) ** 1 k ' k ' exp j 2 2 ( x j k ', j ) 2 log k ', j C k ', j k exp j ( x) 1 1e T x Estimation method: NB parameter estimates are uncoupled LR parameter estimates are coupled © Eric Xing @ CMU, 2006-2010 31 Naïve Bayes vs Logistic Regression Asymptotic comparison (# training examples infinity) when model assumptions correct NB, LR produce identical classifiers when model assumptions incorrect LR is less biased – does not assume conditional independence therefore expected to outperform NB © Eric Xing @ CMU, 2006-2010 32 Naïve Bayes vs Logistic Regression Non-asymptotic analysis (see [Ng & Jordan, 2002] ) convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log m (where m = # of attributes in X) LR order m NB converges more quickly to its (perhaps less helpful) asymptotic estimates © Eric Xing @ CMU, 2006-2010 33 Rate of convergence: logistic regression Let hDis,n be logistic regression trained on n examples in m dimensions. Then with high probability: Implication: if we want for some small constant 0, it suffices to pick order m examples Convergences to its asymptotic classifier, in order m examples result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m © Eric Xing @ CMU, 2006-2010 34 Rate of convergence: naïve Bayes parameters Let any 1, d>0, and any n 0 be fixed. Assume that for some fixed r0 > 0, we have that Let Then with probability at least 1-d, after n examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b © Eric Xing @ CMU, 2006-2010 35 Some experiments from UCI data sets © Eric Xing @ CMU, 2006-2010 36 Case study Dataset 20 News Groups (20 classes) 61,118 words, 18,774 documents Experiment: Solve only a two-class subset: 1 vs 2. 1768 instances, 61188 features. Use dimensionality reduction on the data (SVD). Use 90% as training set, 10% as test set. Test prediction error used as accuracy measure. © Eric Xing @ CMU, 2006-2010 37 Generalization error (1) Versus training size • 30 features. • A fixed test set • Training set varied from 10% to 100% of the training set © Eric Xing @ CMU, 2006-2010 38 Generalization error (2) Versus model size • Number of dimensions of the data varied from 5 to 50 in steps of 5 • The features were chosen in decreasing order of their singular values • 90% versus 10% split on training and test © Eric Xing @ CMU, 2006-2010 39 Summary Naïve Bayes classifier What’s the assumption Why we use it How do we learn it Logistic regression Functional form follows from Naïve Bayes assumptions For Gaussian Naïve Bayes assuming variance For discrete-valued Naïve Bayes too But training procedure picks parameters without the conditional independence assumption Gradient ascent/descent – General approach when closed-form solutions unavailable Generative vs. Discriminative classifiers – Bias vs. variance tradeoff © Eric Xing @ CMU, 2006-2010 40 Appendix © Eric Xing @ CMU, 2006-2010 41 Parameter Learning from iid Data Goal: estimate distribution parameters from a dataset of N independent, identically distributed (iid), fully observed, training cases D = {x1, . . . , xN} Maximum likelihood estimation (MLE) 1. One of the most common estimators 2. With iid and full-observability assumption, write L() as the likelihood of the data: L( ) P( x1, x2 ,, xN ; ) P( x; ) P( x2 ; ), , P( xN ; ) i 1 P( xi ; ) N 3. pick the setting of parameters most likely to have generated the data we saw: * arg max L( ) arg max log L( ) © Eric Xing @ CMU, 2006-2010 42 Example: Bernoulli model Data: We observed N iid coin tossing: D={1, 0, 1, …, 0} Representation: xn {0,1} Binary r.v: Model: How to write the likelihood of a single observation xi ? 1 for x 0 P( x) for x 1 P( x) x (1 )1 x P( xi ) xi (1 )1 xi The likelihood of datasetD={x1, …,xN}: N N i 1 i 1 N xi N 1 xi P( x1 , x2 ,..., xN | ) P( xi | ) xi (1 )1 xi i1 (1 ) i1 © Eric Xing @ CMU, 2006-2010 #head (1 ) # tails 43 Maximum Likelihood Estimation Objective function: l ( ; D) log P( D | ) log n (1 ) n nh log ( N nh ) log( 1 ) h t We need to maximize this w.r.t. Take derivatives wrt l nh N nh 0 1 MLE n h N or MLE 1 N x i i Frequency as sample mean Sufficient statistics The counts, nh , where nk xi , are sufficient statistics of data D i © Eric Xing @ CMU, 2006-2010 44 Overfitting Recall that for Bernoulli Distribution, we have head ML n head head n n tail What if we tossed too few times so that we saw zero head? head We have ML 0, and we will predict that the probability of seeing a head next is zero!!! The rescue: "smoothing" Where n' is know as the pseudo- (imaginary) count head ML n head n ' head n n tail n ' But can we make this more formal? © Eric Xing @ CMU, 2006-2010 45 Bayesian Parameter Estimation Treat the distribution parameters also as a random variable The a posteriori distribution of after seem the data is: p( | D) p( D | ) p( ) p( D | ) p( ) p ( D) p( D | ) p( )d This is Bayes Rule likelihood prior posterior marginal likelihood The prior p(.) encodes our prior knowledge about the domain © Eric Xing @ CMU, 2006-2010 46 Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D). Frequentists dislike this “subjectivity”. Frequentists think of the parameter as a fixed, unknown constant, not a random variable. Hence they have to come up with different "objective" estimators (ways of computing from data), instead of using Bayes’ rule. These estimators have different properties, such as being “unbiased”, “minimum variance”, etc. The maximum likelihood estimator, is one such estimator. © Eric Xing @ CMU, 2006-2010 47 Discussion or p(), this is the problem! Bayesians know it © Eric Xing @ CMU, 2006-2010 48 Bayesian estimation for Bernoulli Beta distribution: P( ; , b ) ( b ) 1 (1 ) b 1 B( , b ) 1 (1 ) b 1 ( )( b ) When x is discrete ( x 1) x( x) x! Posterior distribution of : P( | x1 ,..., xN ) p( x1 ,..., xN | ) p( ) nh (1 ) nt 1 (1 ) b 1 nh 1 (1 ) nt b 1 p( x1 ,..., xN ) Notice the isomorphism of the posterior to the prior, such a prior is called a conjugate prior and b are hyperparameters (parameters of the prior) and correspond to the number of “virtual” heads/tails (pseudo counts) © Eric Xing @ CMU, 2006-2010 49 Bayesian estimation for Bernoulli, con'd Posterior distribution of : P( | x1 ,..., xN ) p( x1 ,..., xN | ) p( ) nh (1 ) nt 1 (1 ) b 1 nh 1 (1 ) nt b 1 p( x1 ,..., xN ) Maximum a posteriori (MAP) estimation: MAP arg max log P( | x1 ,..., xN ) Bata parameters can be understood as pseudo-counts Posterior mean estimation: Bayes p( | D)d C n 1 (1 ) n b 1 d h t nh N b Prior strength: A=+b A can be interoperated as the size of an imaginary data set from which we obtain the pseudo-counts © Eric Xing @ CMU, 2006-2010 50 Effect of Prior Strength Suppose we have a uniform prior (=b=1/2), Weak prior A = 2. Posterior prediction: and we observe n (nh 2,nt 8) 12 p( x h | nh 2, nt 8, '2) 0.25 2 10 Strong prior A = 20. Posterior prediction: 10 2 p( x h | nh 2, nt 8, '20) 0.40 20 10 However, if we have enough data, it washes away the prior. e.g., n (nh 200,nt 800). Then the estimates under 200 10200 weak and strong prior are 211000 and 20 , respectively, 1000 both of which are close to 0.2 © Eric Xing @ CMU, 2006-2010 51 Example 2: Gaussian density Data: We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3} Model: Log likelihood: P( x) 2 2 1 / 2 exp ( x )2 / 2 2 N x N 1 l ( ; D) log P( D | ) log( 2 2 ) n 2 2 2 n1 2 MLE: take derivative and set to zero: l (1 / 2 )n xn l N 1 xn 2 2 2 4 n 2 2 © Eric Xing @ CMU, 2006-2010 1 xn N n 1 2 n xn ML N MLE 2 MLE 52 MLE for a multivariate-Gaussian It can be shown that the MLE for µ and Σ is 1 N 1 N MLE x MLE T n xn ML xn ML n n 1 S N x1T x2T X xT N where the scatter matrix is S n xn ML xn ML T x x N n T n nn xn1 2 x xn n xK n T ML ML The sufficient statistics are nxn and nxnxnT. Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not invertible © Eric Xing @ CMU, 2006-2010 53 Bayesian estimation Normal Prior: P( ) 202 1 / 2 exp ( 0 )2 / 2 02 Joint probability: P( x, ) 2 1 N 2 exp 2 xn 2 n 1 2 N /2 202 1 / 2 exp ( 0 )2 / 2 02 Posterior: P ( | x ) 2~2 where 1 / 2 exp ( ~)2 / 2~2 1 / 02 N / 2 ~ ~ 2 N 1 x , and 0 2 2 N / 2 1 / 02 N / 2 1 / 02 0 mean © Eric XingSample @ CMU, 2006-2010 1 54 Bayesian estimation: unknown µ, known σ 1 / 02 N / 2 N x 0 , 2 2 2 2 N / 1 /0 N / 1 /0 N 1 ~ 2 2 2 0 1 The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels. The precision of the posterior 1/σ2N is the precision of the prior 1/σ20 plus one contribution of data precision 1/σ2 for each observed data point. Sequentially updating the mean µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known) Effect of single data point 02 02 1 0 ( x 0 ) 2 x ( x 0 ) 2 02 02 Uninformative (vague/ flat) prior, σ20 →∞ N 0 © Eric Xing @ CMU, 2006-2010 55