Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1. Introduce kNN---a popular non-parameric technique 2. Illustrate differences between parametric and not parametric techniques Later: 1. Non-Parametric Density Estimation Techniques 2. Editing and Condensing Techniques to Enhance kNN Eick: kNN Classification and Decision Boundaries Classification can be viewed as “learning good decision boundaries” that separate the examples belonging to different classes in a data set. Decision boundary Eick: kNN Problems with Parametric Techniques Parametric approaches assume that the type of model is known before hand, which is not realistic for many application. The types of models of parametric approaches are “kind of simplistic”. If the characteristics of the data do no match the assumptions of the underlying model unreliable predictions are obtained. Non-parametric approaches—key ideas: – “Let the data speak for themselves” – “Predict new cases based on similar cases” – “Use multiple local models instead of a single global model” Eick: kNN Instance-Based Classifiers Set of Stored Cases Atr1 ……... AtrN Class A • Store the training records • Use training records to predict the class label of unseen cases B B C A Unseen Case Atr1 ……... AtrN C B Eick: kNN Instance Based Classifiers Instance-based Classifiers: do not create a model but use training examples directly to classify unseen examples (“lazy” classifiers). Examples: – Rote-learner Memorizes entire training data and performs classification only if attributes of record match one of the training examples exactly – Nearest neighbor Uses k “closest” points (nearest neighbors) for performing classification Eick: kNN kNN— k Nearest-Neighbor Classifiers Unknown record Requires three things – The set of stored records – Distance Metric to compute distance between records – The value of k, the number of nearest neighbors to retrieve To classify an unknown record: – Compute distance to other training records – Identify k nearest neighbors – Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Eick: kNN Definition of Nearest Neighbor X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor K-nearest neighbors of a record x are data points that have the k smallest distance to x Eick: kNN Voronoi Diagrams for NN-Classifiers Each cell contains one sample, and every location within the cell is closer to that sample than to any other sample. A Voronoi diagram divides the space into such cells. Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule. Knowledge of this boundary is sufficient to classify new points. Remarks: Voronoi diagrams can be computed in lower dimensional spaces; in feasible for higher dimensional spaced. They also represent models for clusters that have been generate by representative-based clustering algorithms. Eick: kNN Nearest Neighbor Classification Compute distance between two points: – Euclidean distance d ( p, q ) ( pi i q ) 2 i Determine the class from nearest neighbor list – take the majority vote of class labels among the k-nearest neighbors – Weigh the vote according to distance weight factor, w = 1/d2 Eick: kNN Nearest Neighbor Classification… Choosing the value of k: – If k is too small, sensitive to noise points – If k is too large, neighborhood may include points from other classes X Eick: kNN Voronoi Diagrams for NN-Classifiers Each cell contains one sample, and every location within the cell is closer to that sample than to any other sample. A Voronoi diagram divides the space into such cells. Every query point will be assigned the classification of the sample within that cell. The decision boundary separates the class regions based on the 1-NN decision rule. Knowledge of this boundary is sufficient to classify new points. Remarks: Voronoi diagrams can be computed in lower dimensional spaces; in feasible for higher dimensional spaced. They also represent models for clusters that have been generate by representative-based clustering algorithms. Eick: kNN K-NN:More Complex Decision Boundaries Eick: kNN Nearest Neighbor Classification… Scaling issues – Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes – Example: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Eick: kNN Summary Nearest Neighbor Classifiers k-NN classifiers are lazy learners – Unlike eager learners such as decision tree induction and rule-based systems, it does not build models explicitly – Classifying unknown records is relatively expensive Rely on local knowledge (“let the data speak for themselves”) and not on global models to make decisions. k-NN classifiers rely on a distance function; the quality of the distance function is critical for the performance of a K-NN classifier. Capable to create quite complex decision boundaries which consists of edges of the Voronoi diagram. K-NN classifiers never actually compute decision boundaries, in contrast to decision trees/SVMs. k-NN classifiers obtain high accuracies and are quite popular in some fields, such as text data mining and in information retrieval, in general. Eick: kNN