Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Last lecture summary Test-data and Cross Validation testing error training error model complexity Test set method • Split the data set into training and test data sets. Train Test • Common ration – 70:30 • Train the algorithm on training set, assess its performance on the test set. • Disadvantages – This is simple, however it wastes data. – Test set estimator of performance has high variance adopted from Cross Validation tutorial, Andrew Moore http://www.autonlab.org/tutorials/overfit.html • stratified division – same proportion of data in the training and test sets • Training error can not be used as an indicator of model’s performance due to overfitting. • Training data set - train a range of models, or a given model with a range of values for its parameters. • Compare them on independent data – Validation set. – If the model design is iterated many times, then some overfitting to the validation data can occur and so it may be necessary to keep aside a third • Test set on which the performance of the selected model is finally evaluated. LOOCV 1. choose one data point 2. remove it from the set 3. fit the remaining data points 4. note your error using the removed data point as test Repeat these steps for all points. When you are done report the mean square error (in case of regression). k-fold crossvalidation 1. randomly break data into k partitions 2. remove one partition from the set 3. fit the remaining data points 4. note your error using the removed partition as test data set Repeat these steps for all partitions. When you are done report the mean square error (in case of regression). Selection and testing • Complete procedure to algorithm selection and estimation of its quality 1. Divide data to train/test Train Test 2. By Cross Validation on the Train choose the algorithm Train Val 3. Use this algorithm to construct a classifier using Train Train 4. Estimate its quality on the Test Test polynomial regression degree 1 2 3 4 5 6 MSEtrain MSE10-fold Choice adopted from Cross Validation tutorial by Andrew Moore, http://www.autonlab.org/tutorials/overfit.html Model selection via CV Nearest Neighbors Classification instances • Similarity sij is quantity that reflects the strength of relationship between two objects or two features. • Distance dij measures dissimilarity – Dissimilarity measure the discrepancy between the two objects based on several features. – Distance satisfies the following conditions: • distance is always positive or zero (dij ≥ 0) • distance is zero if and only if it measured to itself • distance is symmetric (dij = dji) – In addition, if distance satisfies triangular inequality |x+y| ≤ |x|+|y|, then it is called metric. Distances for quantitative variables • Minkowski distance (Lp norm) Lp n p x y i 1 i p i • distance matrix – matrix with all pairwise distances p1 p1 p2 p3 p4 0 2.828 3.162 5.099 p2 2.828 0 1.414 3.162 p3 3.162 1.414 0 2 p4 5.099 3.162 2 0 Manhattan distance n L1 d x, y xi yi i 1 y2 x2 x1 y1 Euclidean distance L2 d x, y y2 x2 x1 y1 n x y i 1 i i 2 k-NN • supervised learning • target function f may be – dicrete-valued (classification) – real-valued (regression) • We assign to the class which instance is most similar to the given point. • k-NN is a lazy learner • lazy learning – generalization beyond the training data is delayed until a query is made to the system – opposed to eager learning – system tries to generalize the training data before receiving queries Which k is best? k=1 fitting noise, outliers overfitting k = 15 value not too small smooth out distinctive behavior Hastie et al., Elements of Statistical Learning Real-valued target function • Algorithm calculates the mean value of the k nearest training examples. k=3 value = 12 value = (12+14+10)/3 = 12 value = 14 value = 10 Distance-weighted NN • Give greater weight to closer neighbors unweighted • 2 votes • 2 votes k=4 4 2 5 1 weighted • 1/12 + 1/22 = 1.25 votes • 1/42 + 1/52 = 0.102 votes k-NN issues • Curse of dimensionality is a problem. • Significant computation may be required to process each new query. • To find nearest neighbors one has to evaluate full distance matrix. • Efficient indexing of stored training examples helps – kd-tree Cluster Analysis • We have data, we don’t know classes. • Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters. • We have data, we don’t know classes. • Assign data objects into groups (called clusters) so that data objects from the same cluster are more similar to each other than objects from different clusters. Stages of clustering process On clustering validation techniques, M. Halkidi, Y. Batistakis, M. Vazirgiannis How would you solve the problem? • How to find clusters? • Group together most similar patterns. Single linkage (metoda nejbližšího souseda) based on A Tutorial on Clustering Algorithms http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html BA FL MI/TO MI/TO NA RM 0 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL 877 MI/TO NA RM 0 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO 877 295 0 BA BA 0 FL 662 FL NA MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO NA 877 295 0 754 BA BA 0 FL 662 FL MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO NA RM 877 295 0 754 564 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA FL MI/TO NA RM BA 0 662 877 255 412 FL 662 0 295 468 268 0 754 564 MI/TO 877 295 NA 255 468 754 0 219 RM 412 268 564 219 0 BA FL MI/TO NA/RM BA 0 662 877 255 FL 662 0 295 268 0 564 564 0 MI/TO 877 295 NA/RM 255 268 BA/NA/RM FL MI/TO BA/NA/RM 0 268 564 FL 268 0 295 MI/TO 564 295 0 BA/FL/NA/RM MI/TO BA/FL/NA/RM 0 295 MI/TO 295 0 Torino → Milano Rome → Naples Dendrogram → Bari → Florence Join Torino–Milano and Rome–Naples–Bari–Florence Dendrogram Torino → Milano (138) Rome → Naples (219) → Bari (255) → Florence (268) Join Torino–Milano and Rome–Naples–Bari–Florence (295) dissimilarity 295 268 NA RM 138 219 255 BA FL MI TO dissimilarity TO MI FL RM NA BA Complete linkage (metoda nejvzdálenějšího souseda) BA FL MI/TO MI/TO NA RM 0 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA MI/TO FL 996 MI/TO NA RM 0 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO 996 400 0 BA BA 0 FL 662 FL NA MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO NA 996 400 0 869 BA BA 0 FL 662 FL MI RM NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 MI/TO BA FL MI/TO NA RM 996 400 0 869 669 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 BA FL MI/TO NA RM BA 0 662 996 255 412 FL 662 0 400 468 268 0 869 669 MI/TO 996 400 NA 255 468 869 0 219 RM 412 268 669 219 0 BA FL MI/TO NA/RM BA 0 662 996 412 FL 662 0 400 468 0 869 869 0 MI/TO 996 400 NA/RM 412 468 BA BA MI/TO/FL NA/RM 0 996 412 0 869 869 0 MI/TO/FL 996 NA/RM 412 complete linkage BA NA RM FL MI TO BA NA RM FL MI TO single linkage Average linkage (metoda průměrné vazby) BA MI/TO FL MI/TO 936.5 NA RM 0 (996+877)/2=936.5 BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 Centroid linkage BA MI/TO FL 895 MI/TO NA RM 0 cluster is represented by its centroid BA BA 0 FL 662 FL MI NA RM TO 662 877 255 412 996 0 MI 877 295 295 468 268 400 0 NA 255 468 754 754 564 138 0 RM 412 268 564 219 219 869 0 TO 996 400 138 869 669 669 0 Summary Similarity? • • • • single linkage (MIN) complete linkage (MAX) average linkage centroids Summary • • • • single linkage (MIN) complete linkage (MAX) average linkage centroids Summary • • • • single linkage (MIN) complete linkage (MAX) average linkage centroids Summary • • • • single linkage (MIN) complete linkage (MAX) average linkage centroids Summary • • • • single linkage (MIN) complete linkage (MAX) average linkage centroids Ward’s linkage (method) In Ward’s method metrics are not used, they do not have to be chosen. Instead, sums of squares (i.e. squared Euclidean distances) between centroids of clusters are computed. • Ward's method says that the distance between two clusters, A and B, is how much the sum of squares will increase when we merge them. • At the beginning of clustering, the sum of squares starts out at zero (because every point is in its own cluster) and then grows as we merge clusters. • Ward‘s method keeps this growth as small as possible. Types of clustering • hierarchical – groups data with a sequence of nested partitions • agglomerative – bottom-up – Start with each data point as one cluster, join the clusters up to the situation when all points form one cluster. • divisive – top-down – Initially all objects are in one cluster, then the cluster is subdivided into smaller and smaller pieces. • partitional – divides data points into some prespecified number of clusters without the hierarchical structure – i.e. divides the space Hierarchical clustering • Agglomerative methods are used more widely. • Divisive methods need to consider (2N − 1 −1) possible subset divisions, which is very computationally intensive. – computational difficulties of finding the optimum partitions • Divisive clustering methods are better at finding large clusters than hierarchical methods. Hierarchical clustering • Disadvantages – High computational complexity – at least O(N2). • Needs to calculate all mutual distances. – Inability to adjust once the splitting or merging is performed • no undo k-means • How to avoid the computing of all mutual distances? • Calculate distances from representatives (centroids) of clusters. • Advantage: number of centroids is much lower than the number of data points. • Disadvantage: number of centroids k must be given in advance k-means – kids algorithm • • • • Once there was a land with N houses. One day K kings arrived to this land. Each house was taken by the nearest king. But the community wanted their king to be at the center of the village, so the throne was moved there. • Then the kings realized that some houses were closer to them now, so they took those houses, but they lost some.. This went on and on… • Until one day they couldn't move anymore, so they settled down and lived happily ever after in their village. k-means – adults algorithm • decide on the number of clusters k • randomly initialize k centroids • repeat until convergence (centroids do not move) – assign each point to the cluster represented by the centroid it is nearest to – move the centroids to the position given as a mean of all points in the cluster k-means applet http://www.kovan.ceng.metu.edu.tr/~maya/kmeans/index.html • Disadvantages: – k must be determined in advance. – Sensitive to initial conditions. The algorithm minimizes the following “energy” function, but may be trapped in the local minima. K l 1 xi X l || xi l || 2 – Applicable only when mean is defined, then what about categorical data? E.g. replace mean with mode (k-modes). – Arithmetic mean is not robust to outliers (use median – k-medoids). – Clusters are spherical because the algorithm is based on distance.