Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification Linear Classification • Supervised learning framework • Features can be anything • Targets are discrete classes: – Safe mushrooms vs. poisonous – Malignant vs. benign – Good credit risk vs. bad Ron Parr CPS 271 • Can we treat classes as numbers? – Single class? – Multi class? With content adapted from Andrew Ng, Lise Getoor, and Tom Dietterich Figures from textbook courtesy of Chris Bishop and © Chris Bishop Representing Classes • Interpret t(i) as the probability that the ith element is in a particular class • Classes usually disjoint • For multiclass, t(i) is a vector • t(i)[j]=t(i)j=1 if ith element is in class j, 0 OTW • Notation: For convenience, we will sometimes refer to the “raw” variables x, rather than the features as seen through the lens of our features, φ Decision Boundaries • A classifier can be viewed as partitioning the input space or feature space X into decision regions What is a Linear Disciminant? • Simplest kind of classifer, a linear threshold unit (LTU): 1 if w1 x1 + L + θ n wn ≥ w0 y ( x) = 0 otherwise • • • • We sometimes assume w0=1, so y(x)=wTx A linear discriminant is an n-1 dimensional hyperplane w is orthogonal to this We’ll look at three algorithms, all of which learn linear decision boundaries: – Directly learn the LTU: Using Least Mean Square (LMS) algorithm – Learn the conditional distribution: Logistic regression – Learn the joint distribution: Linear discriminant analysis (LDA) What can be expressed? • Examples of things that can be expressed (Assume n boolean (0/1 features) – Conjunctions: 0 x2 0 0 • x1^x3^x4 : 1⋅x1 + 0⋅x2 +1⋅x3 + 1⋅x4 ≥ 3 • x1^¬x3^x4: 1⋅x1 + 0⋅x2 +-1⋅x3 + 1⋅x4 ≥ 2 0 – at-least-m-of-n 0 0 • at-least-2-of(x1,x2,x4) • 1⋅x1 + 1⋅x2 + 0⋅x3 + 1⋅x4 ≥ 2 1 1 0 • Examples of things that cannot be expressed: 1 1 – Non-trivial disjunctions: 1 • (x1^x3) + (x3^x4) x1 • A linear threshold unit always produces a linear decision boundary. A set of points that can be separated by a linear decision boundary is linearly separable. – Exclusive-Or • (x1^¬x2) + (¬x1^x2) Multiclass Non-linearly separable example • k classes • O(k2) one vs. one classifiers x2 – Expensive – May not be consistent 0 • k-1 one vs. rest classifiers 1 1 – Less expensive – Still may not be consistent 0 • K linear functions x1 – Assign x to class j if wjTx> wiTx for all i – Gives convex, singly connected decision regions – How to pick the linear functions? Why not use regression? The “Neural” Story (Part I) • Regression minimizes sum of squared errors on target function • Gives strong influence to outliers • Nice to justify machine learning w/nature • Naïve introspection works badly • Neural model biologically plausible • Single neuron, linear threshold unit = perceptron • (Longer rant on this later…) Perceptron xj Perceptron Learning wj node/ neuron Y f f is a simple step function (sgn) • • • • • • We are given a set of inputs x(1)…x(n) t(1)…t(n) is a set of target outputs (boolean) {-1,1} w is our set of weights output of perceptron = wTx Perceptron_error(x(i), w) = -net(x(i),w) t(i) Goal: Pick w to optimize: min w ∑ perceptron _ error(x (i ) , w) i∈misclassif ied Perceptron Learning Properties (LTU Properties) • Good news: – If there exists a set of weights that will correctly classify every example, the perceptron learning rule will find it – Does not depend on step size (i ) ∀ : w j ← w j + αx j t ( i ) ∀ i∈misclassified j • Bad news: – Perceptrons can represent only a small class of functions, “linearly separable,” functions – May oscillate if not separable – No obvious generalization for multiclass ! "" ! ## #""$" %&#! Why this form? Logistic Regression • One reason: transforms a linear function in the range (-∞,+∞) to be positive and sum to 1 so that it can represent a probability • In logistic regression, we learn the conditional distribution P(t|x) • Let pt(x; w) be our estimate of P(t|x), where w is a vector of adjustable parameters. • Assume there are two classes, t = 0 and t = 1 and p1 (x; w ) = ew T x ew T 1 + ew x p0 (x; w ) = 1 − p1 (x; w ) T x 1 + ew T x 1.0 • This is equivalent to log p1 (x; w ) = wT x p0 (x; w ) 0.0 -10.0 0.0 10.0 • IOW, the log odds of class 1 is a linear function of x Log Likelihood for Conditional Probability Estimators Constructing a Learning Algorithm • Find the probability distribution h that is most likely, given the data. P( X | hw ) P (hw ) arg max P (hw | X ) = arg max P( X ) hw hw = arg max P( X | hw ) P (hw ) • • – if y(i) = 0, the log likelihood is log(1-p1(x; w)) – if y(i) = 1, the log likelihood is log p1(x; θ) by Bayes’ Rule because P(X) doesn’t depend on h hw = arg max P( X | hw ) if we assume P(h) is uniform We can express the log likelihood in a compact from called the cross-entropy Take an example (x(i),t(i)) • These two are mutually exclusive, so we can combine them to get: [ ] l(w; x ( i ) , t ) = log P (t ( i ) | x (i ) , w ) = (1 − t ( i ) ) log 1 − p1 ( x ( i ) ; w ) + t ( i ) log p1 ( x ( i ) ; w ) hw = arg max log P( X | hw ) because log is monotone • The goal of our learning algorithm will be to find w to maximize: hw • • • The likelihood function views P(X|hw) as a function of the parameters in the model. In this case, our parameters are the weights, w. L(w;X) = P(X|hw) The log likelihood is a commonly used objective function for learning algorithms. It is denoted l(w;X) The w that maximizes the likelihood of the training data is called the maximum likelihood estimator J (w ) = l( w; X, t ) Gradient cont. Computing the Gradient • ∂J (w ) ∂ =∑ l(θ ; t ( i ) , x ( i ) ) ∂w j i ∂w j p1 ( x (i ) ; w ) = ∂ ∂ l ( w; t ( i ) ; x ( i ) ) = ((1 − t (i ) ) log 1 − p1 (x ( i ) ; w ) + t ( i ) log p1 (x (i ) ; w )) ∂w j ∂w j [ ] ∂p (x (i ) ; w ) ( i ) +t = (1 − t i ) 1− p ( x1( i ) ;w ) − 1 1 ∂w j [ = = t(i) p1 ( x ( i ) ; w ) [ = ∂p1 (x (i ) ; w ) 1 p1 ( x ( i ) ;w ) ∂w j ] p1 ( x ( i ) ;w )(1− p1 ( x ( i ) ;w )) ] T (i ) x T (i) ∂ 1 ∂p1 (x ( i ) ; w ) (1 + e − w x ) = T (i) (1 + e − w x ) 2 ∂w j ∂w j 1 (1 + e − w T (i ) x )2 1 (1 + e − w e −w T (i ) x ∂ (− w T x (i ) ) ∂w j T i T (i) x )2 e −w x (− x (i) j ) = p1 (x ( i ) ; w )(1 − p1 (x ( i ) ; w )) x (i ) j ] ∂p (∂xw ; w) (i) t ( i ) − p1 ( x ( i ) ;w ) 1 p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ;w )) j Summary of Logistic Regression The gradient of the loglikelihood for a single point is: ∂ l ( w; x ( i ) , t ( i ) ) ∂w j [ =[ = t ( i ) − p1 ( x ( i ) ; w ) ] ∂p (∂xw ; w) ]p (x ; w)(1 − p (x t ( i ) − p1 ( x ( i ) ; w ) • Learns the Conditional Probability Distribution P(t|x) • No closed form solution • Very simple expression for gradient (i ) 1 p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ; w )) p1 ( x ( i ) ; w )(1− p1 ( x ( i ) ; w )) j (i ) 1 (i ) 1 ; w )) x ( i ) j = (t ( i ) − p1 ( x (i ) ; w )) x (i ) j • 1 1 + e −w So we get: = ∂p1 (x (i ) ; w ) ∂w j Gradient cont. • • = ∂p ( x ( i ) ; w ) ( i) − 1− p(1−( xt ( i ) ); w ) 1 1 ∂w j t ( i ) (1− p1 ( x ( i ) ;w )) −(1− t ( i ) ) p1 ( x ( i ) ;w ) [ Another way of writing the logistic regression function is: The overall gradient is: ∂J (w ) = ∑ (t ( i ) − p1 (x i ; w )) x (i ) j ∂w j i Compare w/percepton rule! What We Already Know • Linear Threshold Unit (LTU) – Tries to discover a linear function (in feature space) that separates positive and negative examples • Logistic Regression – Solve by local search: • Begins with initial weight vector. • Gradient ascent to maximize objective function. • Objective function is the log likelihood of the data • Algorithm seeks the probability distribution P(t|x) that is most likely given the data. • May be done online or in batch • Can be used with acceleration methods (NewtonRaphson, etc.) Density Estimation • Basic unsupervised learning technique • Discussed here in context of classification • Idea: Estimate joint probability of features and class labels – Uses regression to estimate the function log p1 (x; w ) = wT x p0 (x; w ) Discrete Case • Suppose we know P(X1…XnT) • How do we get this? • Maximum likelihood estimate comes from counting (relative frequency) • Bernouli distribution Betting on y • Assuming: – Binary loss function – Choices: t0, t1 • Favor t0 when P(t0|x1…xn) > P(t1|x1…xn) • Use definition of conditional probability: • We see a new x1…xn • What is our guess for t? P (t 0 | x1...xn ) = Naïve Bayes in Action So, are we done??? • How many parameters needed for joint? • Is this practical? • Simplification (Naïve Bayes): • • • • Spam filtering X1…Xn: Spam related features t: Spam label Combine Bayes Rule w/Naïve Bayes: P(t | X 1...X n ) = P( X 1... X n | t ) = ∏ P( X i | t ) i Q: How is this more practical? Is Naïve Bayes Reasonable? • Are features correlated within classes? • How would it hurt us if they were? P (t 0 x1...xn ) P ( x1...xn ) ∏ P(X = i P ( X 1...X n | t ) P (t ) P( X 1... X n ) | t ) P(t ) i P ( X 1...X n ) Things to note: Do we worry about P(X1…Xn)? Influence of P(t)? Linear Discriminant Analysis • In LDA, we learn the distribution P(x|t) • We assume that x is continuous • We assume P(x|t) is distributed according to a multivariate normal distribution and P(t) is a discrete distribution • More on this when we discuss Bayesian networks Estimating the MVG parameters • Also called Gaussian Discriminant Analysis • Here • Given a set of data points {x1,…, xN}, the maximum likelihood estimates for the parameters of the MVG are: µˆ = 1 N ∑x Putting it all together in LDA – t ∼ Bernoulli(w) – x|t=0 ∼ Ν(µ1, Σ) – x|t=1 ∼ Ν(µ2, Σ) • Writing this out, we get: (i ) i p ( x | t = 0) = 1 Σˆ = N ∑ (x (i) − µˆ )( x (i) − µˆ )T p( x | t = 1) = i Picking A Class 1 (2π )n / 2 Σ 1 / 2 1 T exp − ( x − µ1 ) Σ −1 (x − µ1 ) 2 • Recall we assumed Σ same for all classes • When is P(y0|x)>P(y1|x)??? Prior class probability MVG conditional feature probability 1 ( 2π )n / 2 Σ 1 / 2 P( X | t ) P (t ) P( X ) 1 ( 2π )n / 2 Σ 1 / 2 Prior feature probability (ignored) Posterior label probability 1 T exp − (x − µ 0 ) Σ −1 ( x − µ0 ) 2 The Beauty of Homoscedasticity • We again use Bayes rule: P (t | X ) = 1 ( 2π )n / 2 Σ 1 / 2 Example 1 T exp − ( x − µ 0 ) Σ −1 (x − µ 0 ) p ( y 0) > 2 1 T exp − ( x − µ1 ) Σ −1 ( x − µ1 ) p ( y1) 2 (x − µ0 )T Σ −1 (x − µ0 ) > (x − µ1 )T Σ −1 (x − µ1 ) + k Linear!!! Homoscedastic LDA Discussion 1 0 • For multiclass, this gives convex decision boundaries • Satisfies desiderata for multiclass decision boundaries −1 −2 −3 −4 −5 −6 −7 −2 −1 0 1 2 3 4 5 6 7 • How realistic is this? • What do we give up? The decision boundary is at p(y=1|x) = 0.5 Heteroscedastic Distributions Comparing LTU, LR, LDA • Big debate about the relative merits of – direct classifiers (like LTU) versus – conditional models (like LR) versus – generative models (like LDA) '())*+,-. *-,/01+ 23()) 41,01)5 ,- 67,) 89(+438: • LDA vs LR Issues What is the relationship? – In LDA, it turns out the p(t|x) can be expressed as a logistic function where the weights are some function of µ1, µ2, and Σ! – But, the converse is NOT true. If p(t|x) is a logistic function, that does not imply p(x|t) is MVG • Statistical efficiency: if the generative model is correct, then it usually gives better accuracy, especially for small training sets. • Computational efficiency: generative models typically are the easiest to compute. In LDA, we estimated the parameters directly, no need for gradient ascent • Robustness to changing loss function: Both generative and conditional models allow the loss function to change without reestimating the model. This is not true for direct LTU methods • Robustness to model assumptions: The generative model usually performs poorly when the assumptions are violated. • Robustness to missing values and noise: In many applications, some of the features x(i)j may be missing or corrupted for some training examples. Generative models provide better ways of handling this than non-generative models. • LDA makes stronger modeling assumptions than LR – when these modeling assumptions are correct, LDA will perform better • LDA is asymptotically efficient: in the limit of very large training sets, there is no algorithm that is strictly better than LDA – however, when these assumptions are incorrect, LR is more robust • weaker assumptions, more robust to deviations from modeling assumptions • if the data are non-Gaussian, then in the limit, logistic outperforms LDA • For this reason, LR is a more commonly used algorithm