Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NEAREST NEIGHBORS Class 16 CSC 600: Data Mining Today… Measures of Similarity Distance Measures Nearest Neighbors Similarity and Dissimilarity Measures Used by a number of data mining techniques: Nearest neighbors Clustering … How to measure “proximity”? Proximity: similarity or dissimilarity between two objects Similarity: numerical measure of the degree to which two objects are alike Usually in range [0,1] 0 = no similarity 1 = complete similarity Dissimilarity: different measure of the degree in which two objects are Feature Space Abstract n-dimensional space Each instance is plotted in a feature space One axis for each descriptive feature Difficult to visualizes when # of features > 3 As the differences between the values of the descriptive features grows, so too does the distance between the points in the feature space that represent these instances. Distance Metric Distance Metric: A function that returns the distance between two instances a and b. Must, mathematically have the criteria: 1. 2. 3. 4. Non-negativity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 ≥ 0 Identity: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 = 0 ⟺ 𝑎 = 𝑏 Symmetry: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 = 𝑚𝑒𝑡𝑟𝑖𝑐(𝑏, 𝑎) Triangular Inequality: 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑐 ≤ 𝑚𝑒𝑡𝑟𝑖𝑐 𝑎, 𝑏 + 𝑚𝑒𝑡𝑟𝑖𝑐(𝑏, 𝑐) Dissimilarities between Data Objects A common measure for the proximity between two objects is the Euclidean Distance: d(x, y) = n å(x k=1 2 y ) k k x and y are two data objects n dimensions In high school, we typically used this for calculating the distance between two objects, when there were only two dimensions. Defined for one-dimension, two-dimensions, three-dimensions, …, any n-dimensional space Example 𝑑 𝑖𝑑12 , 𝑖𝑑5 = 5.00 − 2.75 2 + 2.50 − 7.50 2 = 30.0625 = 5.4829 Dissimilarities between Data Objects Typically the Euclidean Distance is used as a first choice when applying Nearest Neighbors and Clustering d(x, y) = n 2 (x y ) å k k k=1 x and y are two data objects n dimensions Other distance metrics: Generalized by the Minkowski d(x, y) = distance metric: n 1r å xk - yk r k=1 r is a parameter Dissimilarities between Data Objects Minkowski Distance Metric: r =1 d(x, y) = 1r n åx k - yk r k=1 L1 norm “Manhattan”, “taxicab” Distance r =2 Euclidean L2 Distance norm The larger the value of p the more emphasis is placed on the features with large differences in values because these differences are raised to the power of p. Distance Matrix Once a distance metric is chosen, the proximity between all of the objects in the dataset can be computed Can be represented in a distance matrix Pairwise distances between points Distance Matrix 3 p1 2 p3 p4 1 L1 norm distance. “Manhattan” Distance. L1 p1 p2 p3 p1 0 4 4 p2 4 0 2 p3 4 2 0 p4 6 4 2 p4 6 4 2 0 L2 norm distance. Euclidean Distance. L2 p1 p2 p3 p1 0 2.828 3.162 p2 2.828 0 1.414 p3 3.162 1.414 0 p4 5.099 3.162 2 p4 5.099 3.162 2 0 p2 0 0 1 2 3 4 5 6 Using Weights So far, all attributes treated equally when computing proximity In some situations: some features are more important than others Decision up to the analyst Modified Minkowski distance definition to include 1r weights: n d(x, y) = åw k=1 k xk - yk r Eager Learner Models So far in this course, we’re performed prediction by: 1. 2. 3. Downloading or constructing a dataset Learning a model Using model to classify/predict test instances Sometimes called eager learners: Designed to learn a model that maps the input attributes to the class label, as soon as training data becomes available. Lazy Learner Models Opposite strategy: Delay process of modeling the training data, until it is necessary to classify/predict a test instance. Example: Nearest neighbors Nearest Neighbors k-nearest neighbors k = parameter, chosen by analyst For a given test instance, use the k “closest” points (nearest neighbors) for performing classification “closest” points: defined by some proximity metric, such as Euclidean Distance Unknown record l Requires three things 1. The set of stored records 2. Distance Metric to compute distance between records 3. The value of k, the number of nearest neighbors to retrieve l To classify an unknown record: 1. Compute distance to other training records 2. Identify k nearest neighbors 3. Use class labels of nearest neighbors to determine the class label of unknown record (e.g., by taking majority vote) Definition of Nearest Neighbor The k-nearest neighbors of a given example x are the k points that are closest to x. • Classification changes depending on the chosen k • Majority Voting X (a) 1-nearest neighbor X X (b) 2-nearest neighbor (c) 3-nearest neighbor Tie Scenario: • Randomly choose classification? • For binary problems, usually an odd k is used to avoid ties. Voronoi Tessellations and Decision Boundaries When k-NN is searching for the nearest neighbor, it is partitioning the abstract feature space into a Voronoi tessellation Each region belongs to an instance Contains all the points in the space whose distance to that instance is less than the distance to any other instance Voronoi tessellation (a) (a) Voronoi tessellation Decision boundary = 1) (b) (b) Decision boundary (k (k = 1) Figure: Voronoi tessellation of the feature space Figure: (a) (a) TheThe Voronoi tessellation of the feature space forfor thethe [25] dataset in Table position of the query represented dataset in Table 2 [25]2 withwith thethe position of the query represented byby thethe ? marker; (b) decision boundary created aggregating ? marker; (b) regions thethe decision boundary created aggregating thethe boundary between of the feature spacebyinby which different neighboring Voronoi regions belong to the same target level. neighboring Voronoi regions thatthat belong to the same target level. Decision Boundary: the target levels will be predicted. • Generate the decision boundary by aggregating the neighboring regions that make the same prediction. (a) Voronoi tessellation (b) Decision (b)boundary Decision (k boundary (a) Voronoi tessellation = 1) (k = 1) One of the great things about update the model very easily. Big Idea Fundamentals Standard Approach Epilogue Summary (a) The Voronoi tessellation of the feature for the Idea Fundamentals of the feature Standard Approach Epilogue Summary Figure:neighbor (a)Figure: TheBigVoronoi tessellation space for space thein nearest algorithms is that we can add new data to [25] A Worked Example dataset in Table 2 with the position of the query represented by the [25] dataset in TableA Worked 2 Example with the position of the query represented by the ? marker; (b) the decision boundary created by aggregating the ? marker; (b) the decision boundary created by aggregating the neighboring Voronoi regions that belong to the same target level. neighboring Voronoi regions that belong to the same target level. (a) Voronoi tessellation (a) Voronoi tessellation (b) Decision boundary (k = 1) (b) Decision boundary (k = 1) What’s up with the top-right instance? ndard Approach Epilogue Summary Is it noise? The decision boundary is likely not ideal because of id13. k-NN is a set of local models, each defined by a single instance (b) Decision boundary (k = 1) tion of the feature space for the sition of the query represented by the dary created by aggregating the Sensitive to noise! How to mitigate noise? Choose a higher value of k. Which is the ideal value of k? Different Values of k andard Approach Epilogue Summary Noisy Data (b) Decision boundary (k = 1) k=1 Normalization Cont. Targets Similarity Feat. Sel.Noisy Data Efficiency Normalization Summary Setting k to a high value is riskier with an imbalanced dataset. • Majority target level begins to dominate the feature space. Cont. Targets Similarity Feat. Sel.Noisy DataEfficiency Normalization Summary k=5 Cont. Targets Similarity Feat. Sel. k=15 Efficiency Figure: The decision boundary using majority classification of the Figure: The decision boundary using majority classification of theFigure: The decision boundary when k is set to 15. nearest 3 neighbors. nearest 5 neighbors. tion of the feature space for the osition of the query represented by the dary created by aggregating the t belong to the same target level. k=3 Ideal decision boundary Choose k by running evaluation experiments on a training or validation set. Choosing the right k If k is too small, sensitive to noise points in the training data Susceptible to overfitting If k is too large, neighborhood may include points from other classes Susceptible to misclassification X Computational Issues? Computation can be costly if the number of training examples is large. Efficient indexing techniques are available to reduce the amount of computations needed when finding the nearest neighbors of a test example Sorting training instances? Majority Voting Every neighbor has the same impact on the classification: Majority Voting: y¢ = argmax I (v = y ) å v i ( xi ,yi )ÎDz Distance-weighted: Far away neighbors have a weaker impact on the classification. Distance-Weighted Voting: y¢ = argmax v å ( xi ,yi )ÎDz wi ´ I (v = yi ) Scaling Issues Attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes Example, with three dimensions: height of a person may vary from 1.5m to 1.8m weight of a person may vary from 90lb to 300lb income of a person may vary from $10K to $1M Income will dominate if these variables aren’t standardized. Standardization Treat all features “equally” so that one feature doesn’t dominate the others Common treatment to all variables: Standardize each variable: Mean = 0 Standard Deviation = 1 Normalize each variable: Max = 1 Min = 0 Query / test instance to classify: • Salary = 56000 • Age = 35 Final Thoughts on Nearest Neighbors Nearest-neighbors classification is part of a more general technique called instance-based learning Use specific instances for prediction, rather than a model Nearest-neighbors is a lazy learner Performing the classification can be relatively computationally expensive No model is learned up-front Classifier Comparison Eager Learners Decision Trees, SVMs Model Building: potentially slow Classifying Test Instance: fast Lazy Learners Nearest Neighbors Model Building: fast (because there is none!) Classifying Test Instance: slow Classifier Comparison Eager Learners Decision Trees, SVMs finding a global model that fits the entire input space Lazy Learners Nearest Neighbors classification decisions are made locally (small k values), and are more susceptible to noise References Fundamentals of Machine Learning for Predictive Data Analytics, 1st Edition, Kelleher et al. Data Science from Scratch, 1st Edition, Grus Data Mining and Business Analytics in R, 1st edition, Ledolter An Introduction to Statistical Learning, 1st edition, James et al. Discovering Knowledge in Data, 2nd edition, Larose et al. Introduction to Data Mining, 1st edition, Tam et al.