Download lecture2

Machine Learning Generative verses discriminative classifier Eric Xing Lecture 2, August 12, 2010 Reading: © Eric Xing @ CMU, 2006-2010 1 Generative and Discriminative classifiers  Goal: Wish to learn f: X  Y, e.g., P(Y|X)  Generative:  Modeling the joint distribution of all data  Discriminative:  Modeling only points at the boundary © Eric Xing @ CMU, 2006-2010 2 Generative vs. Discriminative Classifiers  Goal: Wish to learn f: X  Y, e.g., P(Y|X)  Generative classifiers (e.g., Naïve Bayes):  Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data!   Estimate parameters of P(X|Y), P(Y) directly from training data  Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers (e.g., logistic regression)  Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data!  Yn Xn Yn Xn Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 3 Suppose you know the following …  Class-specific Dist.: P(X|Y) p( X | Y  1)   p1 ( X ; 1 , 1 ) p( X | Y  2)   p2 ( X ; 2 , 2 )  Class prior (i.e., "weight"): P(Y)  This is a generative model of the data! © Eric Xing @ CMU, 2006-2010 Bayes classifier: 4 Optimal classification  Theorem: Bayes classifier is optimal!   That is How to learn a Bayes classifier?  Recall density estimation. We need to estimate P(X|y=k), and P(y=k) for all k © Eric Xing @ CMU, 2006-2010 5 Gaussian Discriminative Analysis   learning f: X  Y, where  X is a vector of real-valued features, Xn= < Xn,1…Xn,m >  Y is an indicator vector Yn What does that imply about the form of P(Y|X)?  The joint probability of a datum and its label is: Xn p(x n , ynk  1 |  ,  )  p( ynk  1)  p(x n | ynk  1,  ,  )  k   1 2 1 exp ( x  ) 2 n k 2 (2 2 )1/ 2  Given a datum xn, we predict its label using the conditional probability of the label given the datum: k 1   2 1 exp  ( x   ) 2 n k 2 (2 2 )1/ 2 k p ( yn  1 | x n ,  ,  )  1  exp  21 2 (x n   k ' ) 2  k' 2 1/ 2 (2 ) k'  © Eric Xing @ CMU, 2006-2010  6 Conditional Independence  X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z Which we often write  e.g.,  Equivalent to: © Eric Xing @ CMU, 2006-2010 7 Naïve Bayes Classifier  When X is multivariate-Gaussian vector:  Yn The joint probability of a datum and it label is:   p(x n , ynk  1 |  , )  p( ynk  1)  p(x n | ynk  1,  , )  k   Xn   T 1  1 1 exp ( x  )  ( x  n k n k) 2 1/ 2 (2  ) The naïve Bayes simplification Yn p(x n , y  1 |  ,  )  p( y  1)   p( xn , j | y  1,  k , j ,  k , j ) k n k n k n j  k j 1 (2 ) 2 1/ 2 k, j  exp - 212 ( xn , j -  k , j ) 2 k,j  Xn,1 Xn,2 … Xn,m m  More generally: p(x n , yn |  ,  )  p( yn |  )   p( xn, j | yn , ) j 1  Where p(. | .) is an arbitrary conditional (discrete or continuous) 1-D density © Eric Xing @ CMU, 2006-2010 8 The predictive distribution   Understanding the predictive distribution  k p ( y  1 , x |  , ,  )  n n p( ynk  1 | xn ,  , ,  )    p( xn |  , ) Under naïve Bayes assumption:   p( ynk  1 | xn ,  , ,  )    k N ( xn , | k ,  k ) * k '  k ' N ( xn , | k ' , k ' )  1  j j 2  ( x   )  log   C n k k, j 2  2  k , j    k exp   j     1  j j 2    exp  ( x   )  log   C k ' k '   j  2 2 n k ' k ', j   k ', j   ** For two class (i.e., K=2), and when the two classes haves the same variance, ** turns out to be a logistic function 1 1  p( yn1  1 | xn )    1 1e 1  2 exp      1 exp        j  21 2j ( xnj -2j )2 log j C      1   j j 2  ( xn - 1 )  log j C    j  2 2  j    1  exp   j xnj 1  2j  ( 1j - 2j )  12 ([ 1j ]2 - [ 2j ]2 )  log j (1 1 ) 1   T xn © Eric Xing @ CMU, 2006-2010 9 The decision boundary  The predictive distribution p( yn1  1 | xn )   1  M  1  exp    j xnj   0   j 1  1 1  e  T xn The Bayes decision rule:  1  1  T xn p ( y n  1 | xn )  1  e ln  ln T  p( yn2  1 | xn ) e  xn  T 1  e  xn       Tx n    For multiple class (i.e., K>2), * correspond to a softmax function p( ynk  1 | xn )  e  kT xn e  Tj xn j © Eric Xing @ CMU, 2006-2010 10 Bayesian estimation © Eric Xing @ CMU, 2006-2010 11 Generative vs. Discriminative Classifiers  Goal: Wish to learn f: X  Y, e.g., P(Y|X)  Generative classifiers (e.g., Naïve Bayes):  Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data!   Estimate parameters of P(X|Y), P(Y) directly from training data  Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers:  Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data!  Yi Xi Yi Xi Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 12 Linear Regression  The data: ( x1 , y1 ), ( x2 , y2 ), ( x3 , y3 ),, ( xN , yN )  Both nodes are observed:  X is an input vector  Y is a response vector Yi Xi N (we first consider y as a generic continuous response vector, then we consider the special case of classification where y is a discrete indicator)  A regression scheme can be used to model p(y|x) directly, rather than p(x,y) © Eric Xing @ CMU, 2006-2010 13 Linear Regression   Assume that Y (target) is a linear function of X (features):  e.g.:  let's assume a vacuous "feature" X0=1 (this is the intercept term, why?), and define the feature vector to be:  then we have the following general representation of the linear function: Our goal is to pick the optimal  We seek   . How! that minimize the following cost function: 1 n  J ( )   ( yˆ i ( xi )  yi )2 2 i 1 © Eric Xing @ CMU, 2006-2010 14 The Least-Mean-Square (LMS) method  Consider a gradient descent algorithm:  j t 1   j t    t Now we have the following descent rule: j   J ( )  j t 1 n T   j    ( yi  x i  t ) xij t i 1 For a single training point, we have:   j t 1   j t   ( yi  xi T t ) xij  This is known as the LMS update rule, or the Widrow-Hoff learning rule  This is actually a "stochastic", "coordinate" descent algorithm  This can be used as a on-line algorithm © Eric Xing @ CMU, 2006-2010 15 Probabilistic Interpretation of LMS  Let us assume that the target variable and the inputs are related by the equation: yi   T xi   i where ε is an error term of unmodeled effects or random noise  Now assume that ε follows a Gaussian N(0,σ), then we have:  ( yi   T xi )2  1  p( yi | xi ; )  exp   2 2 2     By independence assumption: n T 2   ( y   x )  1   i i  L( )   p( yi | xi ; )    exp   i 1   2 2  2   i 1   n n © Eric Xing @ CMU, 2006-2010 16 Probabilistic Interpretation of LMS, cont.  Hence the log-likelihood is: l ( )  n log  Do you recognize the last term? Yes it is:  1 1 1 n  2 i 1 ( yi   T xi )2 2   2 1 n T J ( )   (xi   yi )2 2 i 1 Thus under independence assumption, LMS is equivalent to MLE of θ ! © Eric Xing @ CMU, 2006-2010 17 Classification and logistic regression © Eric Xing @ CMU, 2006-2010 18 The logistic function © Eric Xing @ CMU, 2006-2010 19 Logistic regression (sigmoid classifier)  The condition distribution: a Bernoulli p( y | x)   ( x) y (1   ( x))1 y where  is a logistic function  ( x)  1 1e  T x  We can used the brute-force gradient method as in LR  But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model (see future lectures …) © Eric Xing @ CMU, 2006-2010 20 Training Logistic Regression: MCLE  Estimate parameters =<0, 1, ... m> to maximize the conditional likelihood of training data  Training data  Data likelihood =  Data conditional likelihood = © Eric Xing @ CMU, 2006-2010 21 Expressing Conditional Log Likelihood  Recall the logistic function: and conditional likelihood: © Eric Xing @ CMU, 2006-2010 22 Maximizing Conditional Log Likelihood  The objective:  Good news: l() is concave function of   Bad news: no closed-form solution to maximize l() © Eric Xing @ CMU, 2006-2010 23 The Newton’s method  Finding a zero of a function © Eric Xing @ CMU, 2006-2010 24 The Newton’s method (con’d)  To maximize the conditional likelihood l(): since l is convex, we need to find * where l’(*)=0 !  So we can perform the following iteration: © Eric Xing @ CMU, 2006-2010 25 The Newton-Raphson method  In LR the  is vector-valued, thus we need the following generalization:   is the gradient operator over the function  H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010 26 The Newton-Raphson method  In LR the  is vector-valued, thus we need the following generalization:   is the gradient operator over the function  H is known as the Hessian of the function © Eric Xing @ CMU, 2006-2010 27 Iterative reweighed least squares (IRLS)  Recall in the least square est. in linear regression, we have: which can also derived from Newton-Raphson  Now for logistic regression: © Eric Xing @ CMU, 2006-2010 28 Logistic regression: practical issues  NR (IRLS) takes O(N+d3) per iteration, where N = number of training cases and d = dimension of input x, but converge in fewer iterations  Quasi-Newton methods, that approximate the Hessian, work faster.  Conjugate gradient takes O(Nd) per iteration, and usually works best in practice.  Stochastic gradient descent can also be used if N is large c.f. perceptron rule: © Eric Xing @ CMU, 2006-2010 29 Generative vs. Discriminative Classifiers  Goal: Wish to learn f: X  Y, e.g., P(Y|X)  Generative classifiers (e.g., Naïve Bayes):  Assume some functional form for P(X|Y), P(Y) This is a ‘generative’ model of the data!   Estimate parameters of P(X|Y), P(Y) directly from training data  Use Bayes rule to calculate P(Y|X= x) Discriminative classifiers:  Directly assume some functional form for P(Y|X) This is a ‘discriminative’ model of the data!  Yi Xi Yi Xi Estimate parameters of P(Y|X) directly from training data © Eric Xing @ CMU, 2006-2010 30 Naïve Bayes vs Logistic Regression  Consider Y boolean, X continuous, X=<X1 ... Xm>  Number of parameters to estimate:  NB: LR:   1  2  ( x   )  log   C j k, j k, j 2  2   k , j   p ( y | x)  **    1   k '  k ' exp   j  2 2 ( x j  k ', j ) 2  log  k ', j  C   k ', j    k exp   j   ( x)  1 1e  T x Estimation method:  NB parameter estimates are uncoupled  LR parameter estimates are coupled © Eric Xing @ CMU, 2006-2010 31 Naïve Bayes vs Logistic Regression  Asymptotic comparison (# training examples  infinity)  when model assumptions correct   NB, LR produce identical classifiers when model assumptions incorrect  LR is less biased – does not assume conditional independence  therefore expected to outperform NB © Eric Xing @ CMU, 2006-2010 32 Naïve Bayes vs Logistic Regression  Non-asymptotic analysis (see [Ng & Jordan, 2002] )  convergence rate of parameter estimates – how many training examples needed to assure good estimates? NB order log m (where m = # of attributes in X) LR order m  NB converges more quickly to its (perhaps less helpful) asymptotic estimates © Eric Xing @ CMU, 2006-2010 33 Rate of convergence: logistic regression  Let hDis,n be logistic regression trained on n examples in m dimensions. Then with high probability:  Implication: if we want for some small constant 0, it suffices to pick order m examples  Convergences to its asymptotic classifier, in order m examples  result follows from Vapnik’s structural risk bound, plus fact that the "VC Dimension" of an m-dimensional linear separators is m © Eric Xing @ CMU, 2006-2010 34 Rate of convergence: naïve Bayes parameters  Let any 1, d>0, and any n 0 be fixed. Assume that for some fixed r0 > 0, we have that  Let  Then with probability at least 1-d, after n examples: 1. For discrete input, for all i and b 2. For continuous inputs, for all i and b © Eric Xing @ CMU, 2006-2010 35 Some experiments from UCI data sets © Eric Xing @ CMU, 2006-2010 36 Case study   Dataset  20 News Groups (20 classes)  61,118 words, 18,774 documents Experiment:  Solve only a two-class subset: 1 vs 2.  1768 instances, 61188 features.  Use dimensionality reduction on the data (SVD).  Use 90% as training set, 10% as test set.  Test prediction error used as accuracy measure. © Eric Xing @ CMU, 2006-2010 37 Generalization error (1)  Versus training size • 30 features. • A fixed test set • Training set varied from 10% to 100% of the training set © Eric Xing @ CMU, 2006-2010 38 Generalization error (2)  Versus model size • Number of dimensions of the data varied from 5 to 50 in steps of 5 • The features were chosen in decreasing order of their singular values • 90% versus 10% split on training and test © Eric Xing @ CMU, 2006-2010 39 Summary    Naïve Bayes classifier  What’s the assumption  Why we use it  How do we learn it Logistic regression  Functional form follows from Naïve Bayes assumptions  For Gaussian Naïve Bayes assuming variance  For discrete-valued Naïve Bayes too  But training procedure picks parameters without the conditional independence assumption Gradient ascent/descent   – General approach when closed-form solutions unavailable Generative vs. Discriminative classifiers  – Bias vs. variance tradeoff © Eric Xing @ CMU, 2006-2010 40 Appendix © Eric Xing @ CMU, 2006-2010 41 Parameter Learning from iid Data  Goal: estimate distribution parameters  from a dataset of N independent, identically distributed (iid), fully observed, training cases D = {x1, . . . , xN}  Maximum likelihood estimation (MLE) 1. One of the most common estimators 2. With iid and full-observability assumption, write L() as the likelihood of the data: L( )  P( x1, x2 ,, xN ; )  P( x; ) P( x2 ; ), , P( xN ; )  i 1 P( xi ; ) N 3. pick the setting of parameters most likely to have generated the data we saw:  *  arg max L( )  arg max log L( )  © Eric Xing @ CMU, 2006-2010  42 Example: Bernoulli model  Data:   We observed N iid coin tossing: D={1, 0, 1, …, 0} Representation: xn  {0,1} Binary r.v:  Model:  How to write the likelihood of a single observation xi ? 1   for x  0 P( x)   for x  1   P( x)   x (1   )1 x P( xi )   xi (1   )1 xi  The likelihood of datasetD={x1, …,xN}: N N i 1 i 1   N  xi N 1 xi P( x1 , x2 ,..., xN |  )   P( xi |  )   xi (1   )1 xi   i1 (1   ) i1 © Eric Xing @ CMU, 2006-2010   #head (1   ) # tails 43 Maximum Likelihood Estimation  Objective function: l ( ; D)  log P( D |  )  log  n (1   ) n  nh log   ( N  nh ) log( 1   ) h t  We need to maximize this w.r.t.   Take derivatives wrt  l nh N  nh   0   1    MLE n  h N  or  MLE 1  N x i i Frequency as sample mean  Sufficient statistics  The counts, nh , where nk   xi , are sufficient statistics of data D i © Eric Xing @ CMU, 2006-2010 44 Overfitting  Recall that for Bernoulli Distribution, we have head ML n head  head n  n tail  What if we tossed too few times so that we saw zero head? head We have ML  0, and we will predict that the probability of seeing a head next is zero!!!  The rescue: "smoothing"  Where n' is know as the pseudo- (imaginary) count head  ML  n head  n '  head n  n tail  n ' But can we make this more formal? © Eric Xing @ CMU, 2006-2010 45 Bayesian Parameter Estimation  Treat the distribution parameters  also as a random variable  The a posteriori distribution of  after seem the data is: p( | D)  p( D |  ) p( ) p( D |  ) p( )  p ( D)  p( D |  ) p( )d This is Bayes Rule likelihood  prior posterior  marginal likelihood The prior p(.) encodes our prior knowledge about the domain © Eric Xing @ CMU, 2006-2010 46 Frequentist Parameter Estimation Two people with different priors p() will end up with different estimates p(|D).  Frequentists dislike this “subjectivity”.  Frequentists think of the parameter as a fixed, unknown constant, not a random variable.  Hence they have to come up with different "objective" estimators (ways of computing from data), instead of using Bayes’ rule.  These estimators have different properties, such as being “unbiased”, “minimum variance”, etc.  The maximum likelihood estimator, is one such estimator. © Eric Xing @ CMU, 2006-2010 47 Discussion  or p(), this is the problem! Bayesians know it © Eric Xing @ CMU, 2006-2010 48 Bayesian estimation for Bernoulli  Beta distribution: P( ; , b )    (  b )  1  (1   ) b 1  B( , b )  1 (1   ) b 1 ( )( b ) When x is discrete ( x  1)  x( x)  x! Posterior distribution of  : P( | x1 ,..., xN )  p( x1 ,..., xN |  ) p( )   nh (1   ) nt    1 (1   ) b 1   nh  1 (1   ) nt  b 1 p( x1 ,..., xN )  Notice the isomorphism of the posterior to the prior,  such a prior is called a conjugate prior   and b are hyperparameters (parameters of the prior) and correspond to the number of “virtual” heads/tails (pseudo counts) © Eric Xing @ CMU, 2006-2010 49 Bayesian estimation for Bernoulli, con'd  Posterior distribution of  : P( | x1 ,..., xN )   p( x1 ,..., xN |  ) p( )   nh (1   ) nt    1 (1   ) b 1   nh  1 (1   ) nt  b 1 p( x1 ,..., xN ) Maximum a posteriori (MAP) estimation:  MAP  arg max log P( | x1 ,..., xN )   Bata parameters can be understood as pseudo-counts Posterior mean estimation:  Bayes   p( | D)d  C    n  1 (1   ) n  b 1 d  h  t nh   N   b Prior strength: A=+b  A can be interoperated as the size of an imaginary data set from which we obtain the pseudo-counts © Eric Xing @ CMU, 2006-2010 50 Effect of Prior Strength  Suppose we have a uniform prior (=b=1/2),  Weak prior A = 2. Posterior prediction:  and we observe n  (nh  2,nt  8) 12   p( x  h | nh  2, nt  8,    '2)   0.25 2  10  Strong prior A = 20. Posterior prediction: 10  2   p( x  h | nh  2, nt  8,    '20)   0.40 20  10  However,  if we have enough data, it washes away the prior. e.g., n  (nh  200,nt  800). Then the estimates under 200 10200 weak and strong prior are 211000 and 20 , respectively, 1000 both of which are close to 0.2 © Eric Xing @ CMU, 2006-2010 51 Example 2: Gaussian density  Data:  We observed N iid real samples: D={-0.1, 10, 1, -5.2, …, 3}  Model:  Log likelihood:  P( x)  2  2 1 / 2  exp  ( x   )2 / 2 2  N x    N 1 l ( ; D)  log P( D |  )   log( 2 2 )   n 2 2 2 n1  2  MLE: take derivative and set to zero: l  (1 /  2 )n xn     l N 1 xn   2    2 2 4 n  2 2 © Eric Xing @ CMU, 2006-2010 1   xn  N n 1 2  n xn   ML  N  MLE  2  MLE 52 MLE for a multivariate-Gaussian  It can be shown that the MLE for µ and Σ is 1 N 1  N  MLE   x   MLE T n xn   ML xn   ML  n n  1 S N     x1T           x2T     X          xT     N   where the scatter matrix is S  n xn   ML xn   ML   T    x x  N n T n nn  xn1   2 x  xn   n      xK   n  T  ML ML The sufficient statistics are nxn and nxnxnT. Note that XTX=nxnxnT may not be full rank (eg. if N <D), in which case ΣML is not invertible © Eric Xing @ CMU, 2006-2010 53 Bayesian estimation  Normal Prior:  P(  )  202  1 / 2  exp  (   0 )2 / 2 02  Joint probability:  P( x,  )  2   1 N 2 exp  2  xn      2 n 1   2 N /2  202    1 / 2  exp  (   0 )2 / 2 02  Posterior:  P (  | x )  2~2 where  1 / 2  exp  (   ~)2 / 2~2  1 /  02 N / 2 ~ ~ 2   N  1   x   , and  0  2  2  N /  2  1 /  02 N /  2  1 /  02 0   mean © Eric XingSample @ CMU, 2006-2010 1 54 Bayesian estimation: unknown µ, known σ 1 /  02 N / 2 N  x 0 , 2 2 2 2 N / 1 /0 N / 1 /0  N 1  ~ 2   2  2  0   1  The posterior mean is a convex combination of the prior and the MLE, with weights proportional to the relative noise levels.  The precision of the posterior 1/σ2N is the precision of the prior 1/σ20 plus one contribution of data precision 1/σ2 for each observed data point. Sequentially updating the mean  µ∗ = 0.8 (unknown), (σ2)∗ = 0.1 (known)  Effect of single data point  02  02 1  0  ( x  0 ) 2  x  ( x  0 ) 2    02    02  Uninformative (vague/ flat) prior, σ20 →∞  N  0 © Eric Xing @ CMU, 2006-2010 55

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lecture2