Download Bayesian Classification: Why? Bayesian Theorem: Basics Bayes

1/27/2012 Bayes’ Theorem  Given training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theorem P(H | X )  P( X | H )P(H ) P( X )  Informally, this can be written as posterior =likelihood x prior / evidence  MAP (maximum posteriori) hypothesis h  arg max P(h | D)  arg max P(D | h)P(h). MAP hH hH Chris Clifton  Practical difficulty: require initial knowledge of many probabilities, significant computational cost CS490D Bayesian Classification: Why? Naïve Bayes Classifier  A simplified assumption: attributes are conditionally  Probabilistic learning: Calculate explicit probabilities for independent: hypothesis, among the most practical approaches to certain types of learning problems  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities  Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured CS490D n  P( x k | C i ) k 1 The product of occurrence of say 2 elements x1 and x2, given the current class is C, is the product of the probabilities of each element taken separately, given the same class P([y1,y2],C) = P(y1,C) * P(y2,C) No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution. Once the probability P(X|Ci) is known, assign X to the class with maximum P(X|Ci)*P(Ci) P( X | C i )      2 Bayesian Theorem: Basics CS490D 5 Training dataset  Let X be a data sample whose class label is unknown  Let H be a hypothesis that X belongs to class C  For classification problems, determine P(H|X): the Class: C1:buys_computer= ‘yes’ C2:buys_computer= ‘no’ probability that the hypothesis holds given the observed data sample X  P(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)  P(X): probability that sample data is observed  P(X|H) : probability of observing the sample X, given that the hypothesis holds CS490D 4 Data sample X =(age<=30, Income=medium, Student=yes Credit_rating= Fair) 3 age <=30 <=30 30…40 >40 >40 >40 31…40 <=30 <=30 >40 <=30 31…40 31…40 >40 CS490D income student credit_rating high no fair high no excellent high no fair medium no fair low yes fair low yes excellent low yes excellent medium no fair low yes fair medium yes fair medium yes excellent medium no excellent high yes fair medium no excellent buys_computer no no yes yes yes no yes no yes yes yes yes yes no 6 1 1/27/2012 Naïve Bayesian Classifier: Example Bayesian Belief Network: An Example Family History  Compute P(X/Ci) for each class P(age=“<30” | buys_computer=“yes”) = 2/9=0.222 P(age=“<30” | buys_computer=“no”) = 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_computer=“no”)=2/5=0.4 X=(age<=30 ,income =medium, student=yes,credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.007 X belongs to class “buys_computer=yes” Smoker (FH, S) (FH, ~S) (~FH, S) (~FH, ~S) LungCancer Emphysema PositiveXRay Dyspnea LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer: Shows the conditional probability for each possible combination of its parents n Bayesian Belief Networks CS490D 7 Naïve Bayesian Classifier: Comments P( z1,..., zn)   P( zi | Parents ( Z i )) i 1 CS490D 10 Learning Bayesian Networks  Several cases  Advantages :  Given both the network structure and all variables  Easy to implement observable: learn only the CPTs  Good results obtained in most of the cases  Network structure known, some hidden variables:  Disadvantages method of gradient descent, analogous to neural network learning  Network structure unknown, all variables observable: search through the model space to reconstruct graph topology  Unknown structure, all hidden variables: no good algorithms known for this purpose  Assumption: class conditional independence , therefore loss of accuracy  Practically, dependencies exist among variables  E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc  Dependencies among these cannot be modeled by Naïve Bayesian Classifier  D. Heckerman, Bayesian networks for data mining  How to deal with these dependencies?  Bayesian Belief Networks CS490D 8 Bayesian Networks CS490D 11 Neural Networks  Bayesian belief network allows a subset of the variables  Analogy to Biological Systems (Indeed a great example of a conditionally independent good learning system)  A graphical model of causal relationships  Represents dependency among the variables  Massive Parallelism allowing for computational efficiency  Gives a specification of joint probability distribution  The first learning algorithm came in 1959 (Rosenblatt) who Y X Z P CS490D suggested that if a target output value is provided for a Nodes: random variables Links: dependency X,Y are the parents of Z, and Y is the parent of P No dependency between Z and P Has no loops or cycles single neuron with fixed inputs, one can incrementally change weights to learn to produce these outputs using the perceptron learning rule 9 CS490D 12 2 1/27/2012 A Neuron Network Training - mk x0 w0 x1 w1 xn  The ultimate objective of training  obtain a set of weights that makes almost all the tuples in the  training data classified correctly f  Steps output y wn  Initialize weights with random values  Feed the input tuples into the network one by one  For each unit Input weight weighted Activation vector x vector w sum function  The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping CS490D w0 x1 w1 xn wn    Compute the net input to the unit as a linear combination of all the inputs to the unit Compute the output value using the activation function Compute the error Update the weights and the bias 13 A Neuron x0  - mk  I. H. Witten, E. Frank and M. A. Hall f output y Input weight weighted vector x vector w sum For Example Activation function n y  sign(  wi xi  m k ) i 0 CS490D 18 14 Multilayer perceptrons Multi-Layer Perceptron Output vector Errj  O j (1  O j ) Errk w jk Output nodes k  j   j  (l) Errj wij  wij  (l ) Errj Oi Hidden nodes Errj  O j (1  O j )(T j  O j ) wij Input nodes Using kernels is only one way to build nonlinear classifier based on perceptrons Can create network of perceptrons to approximate arbitrary target concepts Multilayer perceptron is an example of an artificial neural network Consists of: input layer, hidden layer(s), and output layer Structure of MLP is usually found by experimentation Parameters can be found using backpropagation ● Oj  1 I 1 e j I j   wij Oi   j i ● ● ● ● Input vector: xi 19 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 3 1/27/2012 Gradient descent example Function: x2+1 ●Derivative: 2x ●Learning rate: 0.1 ●Start value: 4 ● Examples Can only find a local minimum! 20 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 23 Backpropagation ● How to learn weights given network structure? Cannot simply use perceptron learning rule because we have hidden layer(s) Function we are trying to minimize: error Can use a general function minimization technique called gradient descent  Minimizing the error I Need to find partial derivative of error function for each parameter (i.e. weight) ● 𝑑𝐸 𝑑𝑤𝑖 𝑑𝑓 𝑥 𝑑𝑓 𝑥 = 𝑦−𝑓 = 𝑥𝑓 𝑥 1 𝑑𝑤𝑖 𝑑𝑥 −𝑓 𝑥 𝑥= 𝑤𝑖 𝑓 𝑥𝑖 Need differentiable activation function: use sigmoid function instead of threshold function ● 𝑖 𝑑𝑓 𝑥 𝑑𝑤𝑖 𝑑𝐸 = 𝑓′ 𝑥 𝑓 𝑥𝑖 𝑑𝑤𝑖 = 𝑦 − 𝑓 𝑥 𝑓′ 𝑥 𝑓 𝑥𝑖 𝑓 𝑥 = 1 1 exp − 𝑥 Need differentiable error function: can't use zero-one loss, but can use squared error ● 𝐸 21 1 2 Tools Data Mining: Machine = Practical 𝑦− 𝑓 𝑥Learning and Techniques2(Chapter 6) Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 24 The two activation functions Minimizing the error II What about the weights for the connections from the input to the hidden layer? ● 𝑑𝐸 𝑑𝐸 𝑑𝑥 = 𝑑𝑤𝑖𝑗 𝑑𝑥 𝑑𝑤𝑖𝑗 𝑥= 𝑖 𝑤𝑖=𝑓 𝑥𝑦𝑖 − 𝑓 𝑥 𝑓′ 𝑥 𝑑𝑥 𝑑𝑤𝑖𝑗 𝑑𝑥 𝑑𝑤𝑖𝑗 𝑑𝑓 𝑥𝑖 𝑥𝑖 𝑑𝑥𝑖 = 𝑤𝑑𝑓 𝑖 𝑑𝑤 = 𝑓′ 𝑥𝑖 𝑑𝑤𝑖𝑗 𝑖𝑗 𝑑𝑤𝑖𝑗 = 𝑓′ 𝑥𝑖 𝑎𝑖 𝑑𝐸 = 𝑦−𝑓 𝑥 𝑓′ 𝑥 𝑤𝑖 𝑓′ 𝑥𝑖 𝑎𝑖 𝑑𝑤𝑖𝑗 22 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 25 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 4 1/27/2012 Remarks Same process works for multiple hidden layers and multiple output units (eg. for multiple classes) Can update weights after all training instances have been processed or incrementally: Stochastic gradient descent ● Have seen gradient descent + stochastic backpropagation for learning weights in a neural network Gradient descent is a general-purpose optimization technique ● ●   ● batch learning vs. stochastic backpropagation Weights are initialized to small random values ● How to avoid overfitting? Can be applied whenever the objective function is differentiable Actually, can be used even when the objective function is not completely differentiable!  Early stopping: use validation set to check when to stop Weight decay: add penalty term to error function  ●  How to speed up learning?   Momentum: re-use proportion of old weight change Use optimization method that employs 2nd derivative 26 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) ● Radial basis function networks Another type of feedforward network with two layers (plus the input layer) ●Hidden units represent points in instance space and activation depends on distance Stochastic gradient descent cont. ● ● Width may be different for each hidden unit ● Points of equal activation form hypersphere (or hyperellipsoid) as opposed to hyperplane  ● Learning linear models using gradient descent is easier than optimizing nonlinear NN  To this end, distance is converted into similarity: Gaussian activation function  ● Subgradients One application: learn linear models – e.g. linear Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 29 SVMs or logistic regression ● Objective function has global minimum rather than many local minima Stochastic gradient descent is fast, uses little memory and is suitable for incremental online learning Output layer same as in MLP 27 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 30 Learning RBF networks Parameters: centers and widths of the RBFs + weights in output layer Can learn two sets of parameters independently and still get accurate models ● Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Stochastic gradient descent cont. ● For SVMs, the error function (to be minimized) is called the hinge loss ● Eg.: clusters from k-means can be used to form basis functions Linear model can be used based on fixed RBFs Makes learning RBFs very efficient  Disadvantage: no built-in attribute weighting based on relevance RBF networks are related to RBF SVMs ● ● 28 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 31 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 5 1/27/2012 Speed up, combat noise Stochastic gradient descent cont. ● In the linearly separable case, the hinge loss is 0 for a function that successfully separates the data  ● ● Work incrementally Only incorporate misclassified instances Problem: noisy data gets incorporated   The maximum margin hyperplane is given by the smallest weight vector that achieves 0 hinge loss ●   32  ● Subgradient – something that resembles a gradient Use 0 at z = 1 In fact, loss is 0 for z  1, so can focus on z  1 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) and proceed as usual ● ● 35 ● ● (weights can be class-specific) Weighted Euclidean distance: ● Noise (but: k -NN copes quite well with noise) 𝑤12 𝑥1 − 𝑦1 Class correct: increase weight ●Class incorrect: decrease weight ●Amount of change for i th attribute depends on |xi- yi| Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 36 Learning prototypes Only those instances involved in a decision need to be stored ●Noisy instances should be filtered out ●Idea: only use prototypical examples 34 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 2 ● Doesn’t perform explicit generalization ● . . . 𝑤𝑛2 𝑥𝑛 − 𝑦𝑛 ● Remedy: rule-based NN approach 33 2 Update weights based on nearest neighbor All attributes deemed equally important Remedy: weight attributes (or simply select)  ● IB4: weight each attribute ● Slow (but: fast tree-based approaches exist) Remedy: remove noisy instances  Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Weight attributes Remedy: remove irrelevant data  Accept/reject instances Accept if lower limit of 1 exceeds upper limit of 2 ●Reject if upper limit of 1 is below lower limit of 2 ● Practical problems of 1-NN scheme:  Discard instances that don’t perform well Compute confidence intervals for 1. Each instance’s success rate 2. Default accuracy of its class  Instance-based learning ● IB3: deal with noise  Hinge loss is not differentiable at z = 1; cant compute gradient!  IB2: save memory, speed up classification Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Rectangular generalizations Nearest-neighbor rule is used outside rectangles Rectangles are rules! (But they can be more conservative than “normal” rules.) ●Nested rectangles are rules with exceptions ● ● 37 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 6 1/27/2012 Generalized exemplars Numeric prediction Counterparts exist for all schemes previously discussed ● Generalize instances into hyperrectangles ●  Online: incrementally modify rectangles Offline version: seek small set of rectangles that cover the instances   ● ● Decision trees, rule learners, SVMs, etc. Important design decisions: (Almost) all classification schemes can be applied to regression problems using discretization Allow overlapping rectangles?   Requires conflict resolution ● Discretize the class into intervals Predict weighted average of interval midpoints Weight according to class probabilities  Allow nested rectangles? Dealing with uncovered instances?   38 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 41 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Separating generalized exemplars Class 1 Class 2 Separation line 39 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) Generalized distance functions Given: some transformation operations on attributes K*: similarity = probability of transforming instance A into B by chance ● ● Average over all transformation paths Weight paths according their probability (need way of measuring this) ● ● Uniform way of dealing with different attribute types Easily generalized to give distance between sets of instances ● ● 40 Data Mining: Practical Machine Learning Tools and Techniques (Chapter 6) 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bayesian Classification: Why? Bayesian Theorem: Basics Bayes