Download Lecture notes for chapter 7

Midterm Review/Coverage Data mining concepts: what, process, techniques, problems can solve (functionality), evaluation  DW and OLAP: concept and functionality, cube and cuboid, aggregation, OLAP operators (drill down, …), dimension and hierarchy  Data Preprocessing: concept, missing data, noisy data, data normalization, data reduction, sampling, discretization  Data mining primitives: concept and application  Copyright by Jiawei Han, modified 1 Midterm Review/Coverage  Decision Tree Methods: concept, cross-validation, tree construction method (gain and gain ratio), continuous variables, missing values, bias, pruning method (concept), rules Copyright by Jiawei Han, modified 2 If you need help… My office hour (this week only): Wed: 3-5 pm.  TA Office hour:  Huajie Zhang: MC 21, Tuesday, 2-4 pm  Wenxia Jiang, ???, Tuesday, 2-4 pm  Copyright by Jiawei Han, modified 3 Data Mining and Data Warehousing Introduction  Data warehousing and OLAP  Data preprocessing for mining and warehousing  Concept description: characterization and discrimination  Classification and prediction  Association analysis  Clustering analysis  Mining complex data and advanced mining techniques  Trends and research issues  Copyright by Jiawei Han, modified 4 Data Mining and Data Warehousing: Session 5 Classification and Prediction Copyright by Jiawei Han, modified 5 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 6 Mining Classification and Prediction  Classification: Independent variables (description, features) => dependent variables (target, class label)  Training set  Prediction:  Typical Applications  credit approval  target marketing  medical diagnosis  treatment effectiveness analysis  …..  Copyright by Jiawei Han, modified 7 Classification Process(I) Training Data NAME M ike M ary B ill Jim D ave A nne RANK YEARS TENURED A ssistant P rof 3 no A ssistant P rof 7 yes P rofessor 2 yes A ssociate P rof 7 yes A ssistant P rof 6 no A ssociate P rof 3 no Copyright by Jiawei Han, modified Classification Algorithms Classifier (Model) IF rank = ‘professor’ OR years > 6 THEN tenured = ‘yes’ 8 Classification Process(II) Classifier Testing Data Unseen Data (Jeff, Professor, 4) NAME T om M erlisa G eorge Joseph RANK YEARS TENURED A ssistant P rof 2 no A ssociate P rof 7 no P rofessor 5 yes A ssistant P rof 7 yes Copyright by Jiawei Han, modified Tenured? 9 Supervised vs. Unsupervised Learning   Supervised learning (classification)  Supervision: The training data (observations, measurements, etc.) are accompanied by labels indicating the class of the observations  Based on the training set to classify new data Unsupervised learning (clustering)  We are given a set of measurements, observations, etc with the aim of establishing the existence of classes or clusters in the data  No training data, or the “training data” are not accompanied by class labels Copyright by Jiawei Han, modified 10 Evaluating Classification Methods      Predictive accuracy (and more) Speed and scalability  time to learn  speed of the classifier Robustness  noise  missing values Explanability: e.g., decision trees vs. neural networks Goodness of rules  decision tree size  the compactness of classification rules Copyright by Jiawei Han, modified 11 Classification Accuracy: Estimating Error Rates    Partition: Training-and-testing  use two independent data sets, e.g., training set (2/3), test set(1/3)  used for data set with large number of samples Cross-validation  divide the data set into k subsamples  use k-1 subsamples as training data and one sub-sample as test data --- do k times: k-fold cross-validation  for data set with moderate size Bootstrapping (leave-one-out: k = sample size)  for small size data Copyright by Jiawei Han, modified 12 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 13 Training Dataset  An Example from Quinlan’s ID3 Outlook Tempreature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rainCopyright mild high true N by Jiawei Han, modified 14 A Sample Decision Tree Outlook sunny overcast overcast humidity rain windy P high normal true false N P N P Copyright by Jiawei Han, modified 15 Decision-Tree Classification Methods  The basic top-down decision tree generation approach usually consists of two phases:  Tree construction – At start, all the training examples are at the root. – Partition examples recursively based on selected attributes.  Tree pruning – Aiming at removing tree branches that may lead to errors when classifying test data (training data may contain noise, statistical fluctuations, …) Copyright by Jiawei Han, modified 16 Primary Issues in Tree Construction     Split criterion: Goodness function  Used to select the attribute to be split at a tree node during the tree generation phase  Different algorithms may use different goodness functions: – information gain – gini index Branching scheme:  Determining the tree branch to which a sample belongs  binary versus k-ary splitting When to stop the further splitting of a node, e.g. impurity measure Labeling rule: a node is labeled as the class to which most samples at the node belongs Copyright by Jiawei Han, modified 17 Information Gain (ID3/C4.5)  Assume that there are two classes, P and N.  Let the set of examples S contain p elements of class P and n elements of class N.  The amount of information, needed to decide if an arbitrary example in S belong to P or N is defined as I ( p, n )    p p n n  log2 log2 pn pn pn pn Assume that using attribute A as the root in the tree will partition S in sets {S1, S2 , …, Sv}.  If Si contains pi examples of P and ni examples of N, the information needed to classify objects in all subtrees Si : v p n i E ( A)   i I ( pi , ni ) i 1 p  n Copyright by Jiawei Han, modified 18 Information Gain -- Example  The attribute A is selected such that the information gain gain(A) = I(p, n) - E(A) is maximal, that is, E(A) is minimal since I(p, n) is the same to all attributes at a node.  In the given sample data, attribute outlook is chosen to split at the root : gain(outlook) = 0.246 gain(temperature) = 0.029 gain(humidity) = 0.151 gain(windy) = 0.048 Copyright by Jiawei Han, modified 19 Improved Measures for Selecting Attributes   Info gain naturally favors attributes with many values. One alternative measure: gain ratio (Quinlan’86) which is to penalize attribute with many values. |S | |S | SplitInfo(S , A)   i log i . |S| 2 |S| GainRatio(S , A)  Gain(S , A) . SplitInfo(S , A) Problem: denominator can be 0 or close which makes GainRatio very large. There are many other measures. Mingers’91 provides an experimental analysis of effectiveness of several selection measures over a variety of problems.   Copyright by Jiawei Han, modified 20 Gini Index  If a data set T contains examples from n classes, gini index, n gini(T) is defined as gini (T )  1   p2j j 1 where pj is the relative frequency of class j in T.  If a data set T is split into two subsets T1 and T2 with sizes N1 and N2 respectively, the gini index of the split data contains examples from n classes, the gini index gini(T) is defined as ginisplit (T )  N 1 gini(T1)  N 2 gini(T 2) N N  The attribute provides the smallest ginisplit(T) is chosen to split the node (need to enumerate all possible splitting points for each attribute). Copyright by Jiawei Han, modified 21 Continuous in Decision-Tree Induction  Dynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervals. Temperature 40 48 60 72 80 90 play tennis    No No Yes Yes Yes No Sort the examples according to the continuous attribute A, then identify adjacent examples that differ in their target classification, generate a set of candidate thresholds midway, and select the one with the maximum gain. Compare the best numerical split with discrete variable split Extensible to split continuous attributes into multiple intervals. Copyright by Jiawei Han, modified 22 Dealing with Missing Values in C4.5  During tree construction: gain = gain of known cases * prob of known  Partition of training cases: weigh according to partition   During tree testing Weigh according to partition  Weighted sum on different predictions, choose the most likely one.  Better than assigning the most common value  Copyright by Jiawei Han, modified 23 Search and Bias in Decision-Tree Induction  Inductive bias (preference bias): the induction prefers smallest trees, and  trees that place high info gain attribute close to the root.  no representation bias   Search strategies:   Why prefer small hypotheses?   Hill-climbing without backtrack; local optimal Occam’s razor: prefer the simplest hypothesis Decision separators formed by decision trees Copyright by Jiawei Han, modified 24 Avoid Overfitting in Classification     A trees generated may overfit the training examples due to noise or too small set of training data. Two approaches to avoid overfitting:  (Stop earlier): Stop growing the tree earlier.  (Post-prune): Allow overfit and then post-prune the tree. Approaches to determine the correct final tree size:  Separate training and testing sets or use cross-validation.  Use all the data for training, but apply a statistical test to estimate whether expanding or pruning a node may improve over entire distribution.  Use minimum description length (MDL) principle: halting growth of the tree when the encoding is minimized. Rule post-pruning (C4.5): to rules, then pruning. Copyright byconverting Jiawei Han, modified 25 Tree Pruning    A decision tree constructed using the training data may have too many branches/leaf nodes.  Caused by noise, overfitting  May result poor accuracy for unseen samples Prune the tree: merge a subtree into a leaf node.  Using a set of data different from the training data.  At a tree node, if the accuracy without splitting is higher than the accuracy with splitting, replace the subtree with a leaf node, label it using the majority class. Issues:  Obtaining the testing data  Criteria other than accuracy (e.g. minimum description Copyright by Jiawei Han, modified length) 26 Pruning Criterion  Use a separate set of examples to evaluate the utility of postpruning nodes from the tree   CART uses cost-complexity pruning Apply a statistical test to estimate whether expanding (or pruning) a particular node  C4.5 uses pessimistic pruning – error rate: upper limit of the binomial distribution  Minimum Description Length  SLIQ and SPRINT use MDL pruning Copyright by Jiawei Han, modified 27 C4.5: Popular Decision Tree Learner Ross Quinlan, a machine learning researcher  Improved (commercial) version: See5/C5  Free demo version (max 200 cases)  http://www.rulequest.com/  Try it out!   Also Cognos Scenario Demo Copyright by Jiawei Han, modified 28 C4.5rules: rules form tree and pruning For each leaf, form a rule by collecting all attribute-value pairs from the path from root  The rule set is equivalent to the tree  Pruning:  For each condition in rule, delete it, if better when testing on the cross-val set, or using estimated error on the same training set  May greatly simplify the rule set   Difference between rule set and tree Format of the rule vs tree  Unique decision vs possible conflicts that need resolution  Copyright by Jiawei Han, modified 29 More to Think about…  Different cost for errors Classification errors  Tree that minimizes the total loss   Different cost in testing attributes   Tree that minimizes the cost of testing/using the tree Nodes that take combination of attributes Logical combinations  Linear combination  Copyright by Jiawei Han, modified 30 Boosting Techniques (I) Boosting increases classification accuracy.  It can be applied to decision trees or Bayesian classifier  Learn a series of classifiers  Combine classifiers by (weighted) voting  Boosting requires only linear time and constant space increase  Copyright by Jiawei Han, modified 31 Bagging Random sampling with replacement, size N  Build a decision tree from the sample  Do that 10-20 times  To predict: all trees predict, get the voted prediction  Improve unstable classifiers  Copyright by Jiawei Han, modified 32 Adaptive Boosting: AdaBoost Assign every example a equal weight 1/N  Do For t = 1, 2, …, T  Obtain a hypothesis (classifier) h(t) under w(t)  Calculate the error of h(t) and re-weight the examples based on the error  Normalize w(t+1) to sum to 1   Output a weighted sum of all the hypothesis, with each hypothesis weighted according to its accuracy on the training set Copyright by Jiawei Han, modified 33 Classification and Databases  Classification is a classical problem extensively studied by statisticians  AI, especially machine learning researchers   Database researchers re-examined the problem in the context of large databases   most previous studies used small size data, and most algorithms are memory resident recent data mining research contributes to Scalability  Generalization-based classification  Parallel and distributed processing  Copyright by Jiawei Han, modified 34 Classifying Large Dataset  Decision trees seem to be a good choice relatively faster learning speed than other classification methods  can be converted into simple and easy to understand classification rules  can be used to generate SQL queries for accessing databases  has comparable classification accuracy with other methods  Objectives   Classifying data-sets with millions of examples and a few hundred even thousands attributes with reasonable speed. Copyright by Jiawei Han, modified 35 Scalable Decision Tree Methods  Most algorithms assume data can fit in memory.  Data mining research contributes to the scalability issue, especially for decision trees.  Successful examples  SLIQ (EDBT’96 -- Mehta et al.’96)  SPRINT (VLDB96 -- J. Shafer et al.’96) PUBLIC (VLDB98 -- Rastogi & Shim’98)  RainForest (VLDB98 -- Gehrke, et al.’98)  Copyright by Jiawei Han, modified 36 RainForest Gehrke, Ramakrishnan, and Ganti (VLDB’98)  A generic algorithm that separates the scalability aspects from the criteria that determine the quality of the tree.  Based on two observations:  Tree classifiers follow a greedy top-down induction schema  When evaluating each attribute, the information about the class label distribution is enough.  AVC-list (attribute, value, class label) data structure  Copyright by Jiawei Han, modified 44 Data Cube-Based Decision-Tree Induction Integration of generalization with decision-tree induction (Kamber et al’97).  Classification at primitive concept levels  E.g., precise temperature, humidity, outlook, etc.  Low-level concepts, scattered classes, bushy classification-trees  Semantic interpretation problems.  Cube-based multi-level classification  Relevance analysis at multi-levels.  Information-gain analysis with dimension + level.  Copyright by Jiawei Han, modified 45 Presentation of Classification Rules Copyright by Jiawei Han, modified 46 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 47 Bayesian Classification: Why?  Probabilistic learning: Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of learning problems.  Incremental: Each training example can incrementally increase or decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.  Probabilistic prediction: Predict multiple hypotheses, weighted by their probabilities.  Standard: Even in cases where Bayesian methods prove computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured. Copyright by Jiawei Han, modified 48 Bayesian Theorem  Given training data D, posteriori probability of a hypothesis h, P(h|D) follows the Bayes theorem: P(h | D)  P(D | h)P(h) P(D)  MAP (maximum posteriori) hypothesis: h  arg max P(h | D)  arg max P(D | h)P(h). MAP hH hH  Practical difficulty: require initial knowledge of many probabilities, significant computational cost. Copyright by Jiawei Han, modified 49 More on Bayesian Theorem...  p(h|D) = p(D|h) p(h)/p(D) p(h)=p(h|_): priori probability before data (evidence)  D: data, observation, evidence, facts…  P(h|D): posteriori probability after seeing evident D (only)   Hard to calculate… but just to compare: p(h1/D)/p(h2/D)= p(D|h1)p(h1) / p(D|h2)p(h2) Copyright by Jiawei Han, modified 50  A true equation underlying all causal modeling diagnosis: medical, car, … p(+|cancer)=0.98 so p(- |cancer)=0.02 p(+ |no_cancer)=0.03 so p(- |no_cancer)=0.97 If I am tested positive, how likely do I have cancer? p(cancer|+) / p(no_cancer|+) = p(+|c)p(c) / p(+|no_c)p(no_c) = 0.98 p(c) / 0.03 p(no_c) If p(cancer)=0.05, then the ratio = 0.98*0.05 / 0.02*0.95= 2.58 so p(cancer|+) = 2.58/(1+2.58)= 72% if p(cancer)=0.001, then ratio = 0.98*0.001/0.03*0.999=0.033 so p(cancer|+)=0.033/1.033=3.2%   Priori probability matters a lot Copyright by Jiawei Han, modified 51 A true equation underlying all causal modeling  Fortune teller: are they real?  p(long_life | born_in_may) / p(short_life | born_in_may) = p(m | l) p(l) / p(m | s) p(s) = p(l) / p(s) If fortune tellers want to maximize predictive accuracy, they just try to predict most likely events according to your age (and look, …) Copyright by Jiawei Han, modified 52 More on Bayesian Theorem...  D is a set of “observations”: D = d1, d2, … P(h | d1, d 2,...)  P(d1, d 2,...| h)P(h) P(d1, d 2,...) di not in D means not observed (unknown), not false  D is continuously updating, so is p(h|D)  To calculate p(h|D), try p(D|h). Why p(D|h) easier?  “Diagnostic model” p(h|D)  “Causal model”: p(D|h): easier to think about and obtained  Copyright by Jiawei Han, modified 53 Naïve Bayes Classifier (I)  A simplified assumption: attributes are conditionally independent: n P( C j |V )  P( C j ) P( v i | C j ) i 1 Greatly reduces the computation cost, only count the class distribution.  Example: p(play|Outlook=s, T=m, H=h, W=t) =  p(play) p(O=s|play) p(T=m|play) p(H=h|play) p(W=t|play) = 9/14 * 2/9 * … = ... Copyright by Jiawei Han, modified 54 Project Go directly to: www.csd.uwo.ca/faculty/ling/cs411a/411proj.html  Copyright by Jiawei Han, modified 55 Training Dataset  An Example from Quinlan’s ID3 Outlook Tempreature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rainCopyright mild high true N by Jiawei Han, modified 56 Naive Bayesian Classifier (II) Given a training set, we can compute the probabilities  Extremely efficient; training data not in memory  O u tlo o k su n n y o verc ast rain T em p reatu re hot m ild cool P 2 /9 4 /9 3 /9 2 /9 4 /9 3 /9 N 3 /5 0 2 /5 2 /5 2 /5 1 /5 H u m id ity P h ig h 3 /9 n o rm al 6 /9 N 4 /5 1 /5 W in d y tru e false 3 /5 2 /5 Copyright by Jiawei Han, modified 3 /9 6 /9 57 Bayesian Belief Networks (I) Family History Smoker (FH, S) (FH, ~S)(~FH, S) (~FH, ~S) LungCancer Emphysema LC 0.8 0.5 0.7 0.1 ~LC 0.2 0.5 0.3 0.9 The conditional probability table for the variable LungCancer PositiveXRay Dyspnea Bayesian Belief Networks Copyright by Jiawei Han, modified 58 Bayesian Belief Networks (II)  Bayesian belief network allows a subset of the variables conditionally independent  A graphical model of causal relationships  Several cases of learning Bayesian belief networks:  Given both network structure and all the variables: easy.  Given network structure but only some variables.  When the network structure is not known in advance. Copyright by Jiawei Han, modified 59 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 60 Neural Networks  Advantages prediction accuracy is generally high  robust, works when training examples contain errors  output may be discrete, real-valued, or a vector of several discrete or real-valued attributes  fast evaluation of the learned target function.   Criticism long training time  difficult to understand the learned function (weights).  not easy to incorporate domain knowledge   ftp://ftp.sas.com/pub/neural/FAQ.html Copyright by Jiawei Han, modified 61 A Neuron - mk x0 w0 x1 w1 xn f output y wn Input weight vector x vector w   weighted sum Activation function The n-dimensional input vector x is mapped into variable y by means of the scalar product and a nonlinear function mapping Copyright by Jiawei Han, modified 62 Multi Layer Perceptron Output vector S ip h     m v mp ) m 1  ( x)  Output nodes 1 1  e x vp m n i m  f ( ( x  l wl )  r m )  m Hidden nodes l 1 wlm e x  e x f ( x ) ::  ( x )  x  x e e Input nodes Input vector: xi Copyright by Jiawei Han, modified 63 Models and Architectures  Learning paradigms classification  clustering  reinforcement   Network topologies feed-forward  limited recurrent  fully recurrent   Learning algorithm  Cross-validation may minimize the risk of overfitting. Copyright by Jiawei Han, modified 64 Learning Paradigms (I) (1) Classification adjust weights using Error = Desired - Actual (2) Reinforcement adjust weights using reinforcement Inputs Actual Output Copyright by Jiawei Han, modified 65 Learning Paradigms(II) --- Clustering (2) Adjust weights of winner toward input pattern (1) Outputs compete to be winner (0) Inputs Copyright by Jiawei Han, modified 66 Learning Algorithms  Back propagation for classification  Kohonen feature maps for clustering  Recurrent back propagation for classification  Radial basis function for classification  Adaptive resonance theory  Probabilistic neural networks Copyright by Jiawei Han, modified 67 Major Steps  Constructing a network input data representation  selection of number of layers, number of nodes in each layer  Training the network using training data  Pruning the network  Interpret the results  Copyright by Jiawei Han, modified 68 Constructing the Network (I) --- Input Data Representation 3 doctor Categorical Hashed or Discrete Numeric Look up Normalized 0.3 Coded 1 of N 100 Thermometer 111 Binary 011 Thresholded or discretized Continuous Numeric Copyright by Jiawei Han, modified 69 Constructing the Network (II)  The number of input nodes: corresponds to the dimensionality of the input tuples  Thermometer coding: – age 20-80: 6 intervals – [20, 30)  000001, [30, 40)  000011, …., [70, 80)  111111  Number of hidden nodes: adjusted during training  Number of output nodes: number of classes Copyright by Jiawei Han, modified 70 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 77 Other Classification Methods  Genetic algorithm  Instance-based method   k-nearest neighbor classifier  case-based reasoning Fuzzy logic Copyright by Jiawei Han, modified 78 Genetic Algorithm (I)   GA: based on an analogy to biological evolution.  Encoding the problem/solution by string(s) of “genes”  A diverse population (pool) of competing hypotheses is maintained.  At each iteration (generation), String is evaluated as how fit they are. The most fit members are selected to produce new offspring that replace the least fit ones.  Hypotheses are encoded by strings that are combined by crossover operations, and subject to random mutation, to produce offspring. Learning is viewed as a special case of optimization. Copyright by Jiawei Han, modified 79 Genetic Algorithm (II)  IF (level = doctor) and (GPA = 3.6) THEN result=approval  level 001 GPA result 111 10  00111110 10001101 10011110 00101101 Copyright by Jiawei Han, modified 80 Instance-Based Methods Instance-based learning: Store training examples and delay the processing (“lazy evaluation”) until a new instance must be classified.  Typical approaches:  k-nearest neighbor approach:  – Instances represented as points in a Euclidean space.  Locally weighted regression: – Constructs local approximation.  Case-based reasoning: – Uses symbolic representations and knowledge-based inference. Copyright by Jiawei Han, modified 81 The k-Nearest Neighbor Algorithm      All instances correspond to points in the n-D space. The nearest neighbor are defined in terms of Euclidean distance. The target function could be discrete- or real- valued. For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq. Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. . _ _ _ + _ _ . + xq _ + . + Copyright by Jiawei Han, modified . . . 82 Discussion on the k-NN Algorithm   The k-NN algorithm for continuous-valued target functions.  Calculate the mean values of the k nearest neighbors. Distance-weighted nearest neighbor algorithm.  Weight the contribution of each of the k neighbors according to their distance to the query point xq. – giving greater weight to closer neighbors:  1 d ( xq , xi )2 Similarly, we can distance-weight the instances for realvalued target functions. Robust to noisy data by averaging k-nearest neighbors. Curse of dimensionality: distance between neighbors could be dominated by irrelevant attributes. To overcome it,  axes stretch or elimination of the least relevant attributes.   w Copyright by Jiawei Han, modified 83 Locally Weighted Regression Construct an explicit approximation to f over a local region surrounding query instance xq.  Locally weighted linear regression:  The target function f is approximated near xq using the linear function: f ( x)  w  w a ( x)wnan ( x) 0 11  minimize the squared error: distance-decreasing weight K   E ( xq )  1 ( f ( x)  f ( x))2 K(d ( xq , x))  2 xk _nearest _neighbors_of _ x q the gradient descent training rule: w j    K (d ( xq , x))(( f ( x)  f ( x))a j ( x)  x k _ nearest _ neighbors_ of _ xq In most cases, the target function is approximated by a constant, linear, or quadratic function. Copyright by Jiawei Han, modified 84 Case-Based Reasoning      Similarity (to k-nearest neighbors and locally weighted regression): lazy evaluation + analyzing similar instances. Difference: Instances are not “points in a Euclidean space”. Example: Water faucet problem in CADET (Sycara et al’92). Methodology:  Instances represented by rich symbolic descriptions (e.g., function graphs).  Multiple retrieved cases may be combined.  Tight coupling between case retrieval, knowledge-based reasoning, and problem solving. Research issues  Indexing based on syntactic similarity measure, and when failure, backtracking, adapting Copyrightand by Jiawei Han, modifiedto additional cases. 85 Remarks on Lazy vs. Eager Learning      Instance-based learning: lazy evaluation Decision-tree and Bayesian classification: eager evaluation. Key differences:  Lazy method may consider query instance xq when deciding how to generalize beyond the training data D.  Eager method cannot since they have already chosen global approximation when seeing the query. Efficiency: Lazy - less time training but more time predicting. Accuracy:  Lazy method effectively uses a richer hypothesis space since it uses many local linear functions to form its implicit global approximation to the target function.  Eager: must commit to a single hypothesis that covers the entire instance space. Copyright by Jiawei Han, modified 86 Session 5. Classification and Prediction  Introduction  Decision Tree Induction  Bayesian Classification  Neural Networks  Other Classification Methods  Prediction Methods Copyright by Jiawei Han, modified 87 Predictive Modeling in Databases      Predictive modeling: Predict data values or construct generalized linear models based on the database data. One can only predict value ranges or category distributions. Method outline:  Minimal generalization  Attribute relevance analysis  Generalized linear model construction  Prediction. Determine the major factors which influence the prediction.  Data relevance analysis: uncertainty measurement, entropy analysis, expert judgement, etc. Multi-level prediction: drill-down and roll-up analysis. Copyright by Jiawei Han, modified 88 Regress Analysis and Log-Linear Models in Prediction  Linear regression: Y =  +  X Two parameters ,  and  specify the line and are to be estimated by using the data at hand.  using the least squares criterion to the known values of Y1, Y2, …, X1, X2, ….   Multiple regression: Y = b0 + b1 X1 + b2 X2. Many nonlinear functions can be transformed into the above.  Log-linear models:  The multi-way table of joint probabilities is approximated by a product of lower-order tables.   Probability: p(a, b, c, d) = ab acad bcd Copyright by Jiawei Han, modified 89 Prediction: Numerical Data Copyright by Jiawei Han, modified 90 Prediction: Categorical Data Copyright by Jiawei Han, modified 91 Conclusions  Classification is an extensively studied problem (mainly in statistics, machine learning & neural networks)  Classification is probably one of the most widely used data mining techniques with a lot of applications.  Scalability is still an important issue for database applications.  Combining classification with database techniques should be a promising research topic.  Research Direction: Classification of non-relational data, e.g., text, spatial, multimedia, etc.. Copyright by Jiawei Han, modified 92

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture notes for chapter 7