Download 10. C10-Distance Measure

Artificial Intelligence Distance Measure Dae-Won Kim School of Computer Science & Engineering Chung-Ang University Data mining is the process of discovery of unknown relationship between patterns. Thus, many data mining techniques are based on similarity between patterns. We need to define what we mean by similar, so that we can calculate formal similarity measures. Similarity is obtained from vectors of measurements (features) describing each pattern. Instead of talking about how similar two objects are, we could talk about how dissimilar they are. We can easily define the other by applying a monotonically decreasing transformation. e.g.) Similarity S(x,y) is given,  Dissimilarity D(x,y)=1 - S(x,y)  Dissimilarity D(x,y)=1 / S(x,y) The term distance is often used informally to refer to a dissimilarity measure. The Euclidean distance: D( x, y)  f  x(i)  y(i) 2 i 1 It is simple and effective, and is called metric. Q: What happens if it’s not a metric. A metric is a dissimilarity measures that satisfies three conditions: 1. D(x,y)  0 and D(x,x)=0 2. D(x,y) = D(y,x) 3. D(x,y)  D(x,k)+D(k,y) (triangle inequality) However, the Euclidean distance has some limitations. Limit 1. What if the features were measured using different units? (e.g., length, weight) We can make all features equally important by using normalization and standardization. Or, if we have some idea of the relative importance for each feature, then we can weight them. The weighted Euclidean distance: D( x, y)  f  w(i)x(i)  y(i) 2 i 1 Limit 2. What if the shape of each class is not hyper-sphere? Two pairs of data in the same class can yield different distance values. The Mahalanobis distance: D( x, y)  ( x(i)  y(i))T  1 ( x(i)  y(i)) The Euclidean distance is generalized to the Minkowski (L). 1   f  D( x, y)    x(i)  y(i)    i 1  The Euclidean distance is the special case of =2.  f 2 D( x, y )    x(i)  y (i)    i 1  1 2 Sample (Pearson) correlation coefficient:  x (i)  x y (i)  y   i f D (x ,y )  1  x (i)  x   y (i)  y   i i f 1 2 f 1 2 The Cosine distance: f x y D (x ,y )   x y x (i)y (i)  i 1 f 2 x ( i )  i1 f 2 y ( i )  i1 Q1: Euclidean vs. Correlation vs. Cosine distance Q2: Which one of the three is used in Google? We are now discussing how to compute the similarity/distance between two categorical patterns. For binary categorical data, we can count the number of features on which two patterns take the same or take different values. Rather than measuring the dissimilarities between patterns, we often measure the similarities. The simple matching coefficient: S ( x, y )  nx:1, y:1 (i )  nx:0, y:0 (i ) nx:1, y:1 (i )  nx:1, y:0 (i )  nx:0, y:1 (i )  nx:0, y:0 (i ) It may be inappropriate to include the (x:0,y:0) or (x:1,y:1) depending on the meaning of 0 and 1. The Jaccard coefficient: S ( x, y )  nx:1, y:1 (i ) nx:1, y:1 (i )  nx:1, y:0 (i )  nx:0, y:1 (i ) How about categorical data in which the features have more than two categories? The Hamming distance: f D( x, y )   h( x(i ), y (i )) i 1 0 h( x(i ), y (i ))   1 if x(i )  y (i ) otherwise For example, what is your approach to the mixed-type (numeric and categorical features) data ? You need to develop a good distance measure for mixed-data. Approach 1. Preprocessing Preprocess the mixed data into a single-type data. 1. discretize numeric features or 2. encode categorical features as numeric integer values. However, this approach often leads to loss of important information. Approach 2. Mixed-data distance Let us mix two distance measures. D(x,y)= Euclidean(x,y) + Hamming(x,y) What would be issues? D(x,y)=Normalized Euclidean(x,y) + Normalized Hamming(x,y) Tip: HVDM will be a starting point (Heterogeneous Value Difference Metric) Two values are considered to be closer if they have more similar classifications (labels) A feature color has three values red, green and blue when identifying if an object is an apple. Red and green would be considered closer than red and blue because the former two both have similar correlations with the output class apple. Q: Distance measure in Project II-2 The 1st trial: 1NN using Euclidean distance for fixed-length data Two same characters can show different signal shapes in time and magnitude. Is the Euclidean distance working? Two signals should be aligned and stretched in each axis. Of the algorithms you are familiar with, which one is best to calculate the distance between two patterns? Dynamic Programming Dynamic Time Warping The DTW is the most widely used algorithm to measure similarity between two given signals that vary in time or speed by warping the signals non-linearly in the time dimension As expected, it is easy to implement (dynamic programming). DTW for 1-D time-series is simple. Tip. Project-II contains 3-D time-series. We are expecting your good job.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 10. C10-Distance Measure