Download P( i | )

 Bayesian Classification l f  Other Classification Methods  Classification Accuracy  Summary Classification II 1 Probabilistic P b bili ti llearning: i Calculate C l l t explicit li it probabilities b biliti ffor hypothesis, among the most practical approaches to certain types yp of learning gp problems  Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. t P Prior i k knowledge l d can b be combined bi d with ith observed b d data.  Probabilistic prediction: Predict multiple hypotheses, hypotheses weighted by their probabilities  Standard: Even when Bayesian y methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured  Classification II 2  Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem P (h | D )  P ( D | h) P (h) P(D)  MAP (maximum a-posteriori) hypothesis h  arg max P (h | D )  arg max P ( D | h) P (h). MAP hH hH  Practical difficulty: require initial knowledge of many probabilities, significant computational cost Classification II 3  The classification problem may be formalized using a-posteriori p p probabilities:  P(C|X) = probability that the sample tuple X=<x1,,…,x , k> is of class C. e.g. P(class=N |outlook=sunny,windy=true,…)  Idea: assign to sample X the class label C such that P(C|X) is maximal Classification II 4  Bayes theorem: P(C|X) = P(X|C) P(X|C)·P(C) P(C) / P(X)  P(X) is constant for all classes  P(C) = relative l ti ffreq. off class l C samples l  choose C such that P(C|X) is maximum = choose C such that P(X|C)·P(C) is maximum  Problem: computing P(X|C) is unfeasible! Classification II 5 A simplified assumption: attributes are conditionally independent: n P (C j | V )  P (C j ) P ( v j | C j ) i 1 G Greatlyy reduces the computation p cost,, onlyy count the class distribution. Classification II 6  Naïve assumption: attribute independence P(x1,…,xk|C) = P(x1|C)·…·P(xk|C)  If i-th attribute is categorical: g P(xi|C) is estimated as the relative freq of samples having g value xi as i-th attribute in class C  If i-th attribute is continuous: P(xi|C) is estimated thru a Gaussian density function  Computationally easy in both cases Classification II 7 Outlook sunny sunny overcast rain rain rain overcast sunny sunny rain sunny overcast overcast rain Temperature hot hot hot mild cool cool cool mild cool mild mild mild hot mild Humidity high high high high normal normal normal high normal normal normal high normal high P(p) = 9/14 P(n) = 5/14 Windy Class false N true N false P false P false P true N true P false N false P false P true P true P false P true N outlook P(sunny|p) = 2/9 P(sunny|n) = 3/5 P(overcast|p) = 4/9 P(overcast|n) = 0 P( i | ) = 3/9 P(rain|p) P( i | ) = 2/5 P(rain|n) temperature P(h | ) = 2/9 P(hot|p) P(h | ) = 2/5 P(hot|n) P(mild|p) = 4/9 P(mild|n) = 2/5 P(cool|p) = 3/9 P(cool|n) = 1/5 humidity P(high|p) = 3/9 P(high|n) = 4/5 P(normal|p) = 6/9 P(normal|n) = 1/5 windy y P(true|p) = 3/9 P(true|n) = 3/5 P(false|p) = 6/9 P(false|n) = 2/5  An unseen sample X = <rain, hot, high, false>  P(X|p) P(X|p)·P(p) P(p) = P(rain|p)·P(hot|p)·P(high|p)·P(false|p)·P(p) = 3/9·2/9·3/9·6/9·9/14 = 0.010582 0 010582  P(X|n)·P(n) = P(rain|n)·P(hot|n)·P(high|n)·P(false|n)·P(n) = 2/5·2/5·4/5·2/5·5/14 = 0.018286  Sample X is classified as class n (don’t play tennis) Classification II 9  Bayesian Classification  Other Classification Methods  Classification Accuracy  Summary Classification II 10  Advantages     p prediction accuracyy is g generallyy high g robust, works when training examples contain errors output p mayy be discrete, real-valued, or a vector of several discrete or real-valued attributes fast evaluation of the learned target function  Criticisms    long g training g time (model ( construction time)) difficult to understand the learned function (weights) not easyy to incorporate p domain knowledge g Classification II 11 X1 X2 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 X3 Y 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 Output Y is 1 if at least two of the three inputs are equal to 1. Classification II 12 X1 X2 1 1 1 1 0 0 0 0 0 0 1 1 0 1 1 0 X3 Y 0 1 0 1 1 0 1 0 0 1 1 1 0 0 1 0 Y  I ( 0 .3 X 1  0 .3 X 2  0 .3 X 3  0 .4  0 )  1 if the condition is true where I ( z )   otherwise 0 Classification II 13 Model is an assemblyy of inter-connected nodes and weighted links  Output node sums up each of its input value according to the weights of its links  Compare output node against some threshold t  Perceptron Model Y  I (  wi X i  t ) i or Y  sign (  wi X i  t ) i Classification II 14 Classification II 15    Initialize the weights (w0, w1, …, wk) Adj t the Adjust th weights i ht in i such h a way th thatt th the output t t off NN is consistent with class labels of training examples It can be done by optimizing the objective function:   E   Yi  Yi  2  where Yi & Yi are the actual class label value and the iD predicted class value respectively of the i-th training data of data set D  The most well known algorithm is the backpropagation algorithm Classification II 16  Several  ARCS: Quantitative association mining and clustering of association rules (Lent et al’97) al 97)   It beats C4.5 in (mainly) scalability and also accuracy Associative classification: (Liu et al al’98) 98)   methods for association-based association based classification It mines high support and high confidence rules in the form of “cond_set => y”, where y is a class label CAEP (Classification by aggregating emerging patterns) (Dong et al’99)   Emerging patterns (EPs): the itemsets whose support increases significantly from one class to another Mine EPs based on minimum support and growth rate Classification II 17  Instance-based Instance based  Store training examples and delay the processing (“lazy evaluation”)) until a new instance must be classified evaluation  Typical  approaches k-nearest k nearest neighbor approach   classification: Instances represented as points in a Euclidean space. Case-based reasoning  Uses symbolic representations and knowledge-based inference Classification II 18 Set of Stored C ases A tr1 … … ... A trN C lass A • Store the training records • Use training g records to predict the class label of unseen cases B B C A Un seen C ase … … ... A tr1 C B Classification II 19 A trN  Basic idea:  If it walks like a duck, quacks like a duck, then it’s probably a duck Compute Di Distance Test Record Choose k of the “nearest” records Training Records Classification II 20  Unknown record Requires three things – The Th sett off stored t d records d – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve  To classify an unknown record: – Compute distance to other ttraining i i records d – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by g majority j y vote)) taking Classification II 21      Instance-based learning: g lazyy evaluation Decision-tree: eager evaluation Key differences  Lazy method may consider query instance xq when deciding how to generalize beyond the training data D  Eager method cannot since they have already chosen global approximation when seeing the query Efficiency: y Lazyy - less time training g but more time predicting p g Accuracy  Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function  Eager: must commit to a single hypothesis that covers the entire instance space Classification II 22  Bayesian classification  Other Classification Methods  Classification Accuracy  Summary Classification II 23 Partition: P titi T Training-and-testing i i d t ti  use two independent data sets, e.g., training set (2/3), test set(1/3)  used for data set with large number of samples  Cross Cross-validation validation  divide the data set into k subsamples  use k-1 subsamples as training p g data and one sub-sample p as test data --- k-fold cross-validation  for data set with moderate size  Classification II 24  Cl ifi ti is Classification i an extensively t i l studied t di d problem bl ((mainly i l iin statistics, machine learning & neural networks)  Classification is probably one of the most widely used data mining techniques with a lot of extensions  Scalability is still an important issue for database applications: thus combiningg classification with database techniques should q be a promising topic  Research directions: classification of non non-relational relational data, e.g., text, spatial, multimedia, etc.. Classification II 25

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download P( i | )