* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Diapositiva 1 - Taiwan Evolutionary Intelligence Laboratory
Survey
Document related concepts
Mixture model wikipedia , lookup
Human genetic clustering wikipedia , lookup
Principal component analysis wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
Transcript
Brief discussion on DSMGA-II, Optimization, and Clustering Peng Chun-Jen Taiwan Evolutionary Intelligence Laboratory 2016/09/21 Group Meeting Presentation Outline • Brief on Optimization • Mask order selection of DSMGA-II – MIMIC : A distribution approach of MBGA – Kernel K-means : A clustering approach • Feature Selection and Feature Transformation – Selection methods : Filtering and Wrapping – Transformation methods: PCA and ICA Disclaimer • I’m sharing things I wish I knew when I was a noob • Mostly Qualitative description, since I’m still a noob • This is an “subjective” review on Udacity ML Lesson Semi-supervised vs. Unsupervised • Definition by Wiki – Semi-supervised learning is a class of supervised learning tasks and techniques that also make use of unlabeled data for training – typically a small amount of labeled data with a large amount of unlabeled data. – Unsupervised learning is the ML task of inferring a function to describe hidden structure from unlabeled data. Semi-supervised vs. Unsupervised Reference: https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png Reference: http://insidebigdata.com/wp-content/uploads/2014/11/customer_segmentation_vert.jpg (Semi-)Unsupervised Learning • Definition by Wiki – Unsupervised learning is the ML task of inferring a function to describe hidden structure from unlabeled data. • Optimization : GA!! • Clustering : MBGA?? • Blind signal separation : MBGA?? • Anomaly detection, NN, EM, … Optimization • Small input space: Generate and test • Derivative : Calculus, Newton’s Method • Real-world: – Large input space – No derivative – Complex function – Possible multiple local optimal Reference: https://www.rmiguides.com/international/_images/slideshows/everest/5.jpg Optimization • Definition : – Find 𝑥𝑚𝑎𝑥 ∈ 𝑋 𝑠𝑜 𝑡ℎ𝑎𝑡 𝑓 𝑥𝑚𝑎𝑥 = max 𝑓(𝑥) 𝑥 • Hill Climbing – Stuck at local max • Random Restart – Might not have points in max hill Reference: https://classconnection.s3.amazonaws.com/608/flashcards/1910608/jpg/fig_4_1_hillclimbing1349062649219.jpg Optimization • Hill Climbing • Random Restart • Simulated Annealing – Parameter tuning – Need random restart Reference: http://i.makeagif.com/media/3-09-2014/XT0sjr.gif Optimization • Hill Climbing • Random Restart • Simulated Annealing – Parameter tuning – Need random restart Reference: http://www.itm.uni-stuttgart.de/research/alpso/bilder/pso.gif Optimization • Hill Climbing • Random Restart • Simulated Annealing • Particle Swarm Optimization Reference: http://www.frontiersin.org/files/Articles/126819/fendo-05-00239HTML/image_m/fendo-05-00239-g001.jpg – Only compare self and nearest neighbor, (and global best) – Population converge easily and stuck at misleading local optimal Optimization • Hill Climbing • Random Restart • Simulated Annealing • PSO Reference: http://giphy.com/gifs/9Mld29SbBA9xe • Genetic Algorithm – PSO: Cross Over evenly spreads information in population – SA: Use XO and mutation to generate better neighbors Optimization • Hill Climbing • Random Restart • Simulated Annealing • PSO • Genetic Algorithm Reference: https://classconnection.s3.amazonaws.com/608/flashcards/1910608/jpg/fig_4_1_hillclimbing1349062649219.jpg – Parameter Tuning (population size) – NFE wasted on ineffective points Optimization • Hill Climbing Lessons: • Random Restart • Multiple points (population) • Simulated Annealing • Delayed reward • PSO • Neighbors’ info. • Genetic Algorithm • Info. Spread in population Optimization • Genetic Algorithm – How to spread information among population? • Need to learn structure before XO • Lots of work done in MBGA field – How to pose new neighbors in promising neighborhood? • Need to capture history and distribution • (More ES approach in populationless GA? Use TABU distribution to generate new chromosome) Outline • Brief on Optimization • Mask order selection of DSMGA-II – MIMIC : A distribution approach of MBGA – Kernel K-means : A clustering approach • Feature Selection and Feature Transformation – Selection methods : Filtering and Wrapping – Transformation methods: PCA and ICA Improve DSMGA-II • Mask order selection of DSMGA-II – Use mask(s) to generate N clusters, and estimate which clustering is better to choose masks • How to generate clusters with only distance matrix? – PAM or Kernel K-means or MDS : A clustering approach • How to evaluate clusters? – Kullback-Leibler Divergence : A distribution approach Outline • Brief on Optimization • Mask order selection of DSMGA-II – MIMIC : A distribution approach of MBGA – Kernel K-means : A clustering approach • Feature Selection and Feature Transformation – Selection methods : Filtering and Wrapping – Transformation methods: PCA and ICA EDA (Estimation of Distribution Algorithms) • Learn Structure with Distribution – Generates new individuals from a probability distribution instead of crossover or mutation Reference: Prof. Yu’s GA Lecture Slides, Lecture 07 Road to Competence slide 27 EDA (Estimation of Distribution Algorithms) • Learn Structure with Distribution – Generates new individuals from a probability distribution instead of crossover or mutation Reference: https://www.wikiwand.com/en/Estimation_of_distribution_algorithm MIMIC • Mutual Information Maximization for Input Clustering • MIMIC is an EDA proposed in [de Bonet et al., 1997] MIMIC • Assumption : Variables are dependent – Generates new individuals from a probability distribution – Find distribution with dependency tree 𝑋 = 𝑥1 𝑥2 𝑥3 … 𝑥𝑛 𝑃 𝑋 = 𝑃 𝑥1 𝑥2 𝑥3 … 𝑥𝑛 𝑃 𝑥2 𝑥3 … 𝑥𝑛 𝑃 𝑥𝑛−1 𝑥𝑛 𝑃 𝑥𝑛 𝑃𝜋 𝑋 = 𝑃 𝑥𝑖1 𝑥𝑖2 𝑃 𝑥𝑖2 𝑥𝑖3 𝑃 𝑥𝑖𝑛−1 𝑥𝑖𝑛 𝑃 𝑥𝑖𝑛 ⇒ 𝑡𝑜𝑜 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡 ⇒ 𝑒𝑎𝑠𝑖𝑒𝑟 𝑡𝑜 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 – Need to estimate how well 𝑃𝜋 𝑋 describes 𝑃 𝑋 Kullback-Leibler Divergence • Also known as discrimination info., relative entropy – The amount of information lost when Q is used to approximate P – Discrete Continuous – Properties 𝐷𝐾𝐿 (𝑃| 𝑄 ≠ 𝐷𝐾𝐿 (𝑄| 𝑃 Reference: https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence Kullback-Leibler Divergence For 𝑃𝜋 𝑋 to describes 𝑃 𝑋 best, we have to minimize 𝐷𝐾𝐿 (𝑃| 𝑃𝜋 Need min. 𝑱′ 𝝅 𝑿 = −𝚺𝒉 𝒙𝒊 + = −𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 ) Minimizing −𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 ) is Maximizing Mutual Info. 𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 ) Kullback-Leibler Divergence We need maximum mutual information between all dependent variables Mutual Info. 𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 ) for 𝑃𝜋 𝑋 to describes 𝑃 𝑋 best, i.e. min 𝐷𝐾𝐿 (𝑃| 𝑃𝜋 So given a MI matrix, we need to find Maximum Spanning Tree! For dense matrix, use Prim Algo. Reference: Prof. Yu’s GA Lecture Slides, Lecture 07a-LTGA DSMGA2 Prim’s Algorithm Given a MI matrix, we need to find Maximum Spanning Tree! Start with random node, Maximum Spanning Tree is similar to LTGA MIMIC Lessons • Assumption : Variables are dependent – 𝑃𝜋 𝑋 Use “Pair-wise” Mutual information • Find distribution with dependency tree – MIMIC generated structure similar to LTGA – What about ILS? • Use KLD to optimize distribution – Might be a good tool for to estimate models in DSMGA-II Outline • Brief on Optimization • Mask order selection of DSMGA-II – MIMIC : A distribution approach of MBGA – Kernel K-means : A clustering approach • Feature Selection and Feature Transformation – Selection methods : Filtering and Wrapping – Transformation methods: PCA and ICA Clustering • Connective-based clustering (Hierarchical clustering) – Single Linkage Clustering • Centroid-based clustering – K-means • Soft clustering – EM Clustering • Connective-based clustering (Hierarchical clustering) – Single Linkage Clustering (LTGA, ILS) Reference: http://www.multid.se/genex/onlinehelp/clustering_distances.png Reference: http://image.slidesharecdn.com/log6kntt4i4dgwfwbpxw-signature75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a5c52076dfcf2a19db1-poli151222094044/95/hierarchical-clustering-15-638.jpg?cb=1450777272 Clustering • Centroid-based clustering – K-means – Assume structure is centroid-based? – Need Euclidean distances so cannot directly apply Reference: http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/ Clustering • Soft clustering – Expectation Maximization – Solve overlapping? – Solve Cross-competition? Reference: https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_ Faithful_data.gif DSMGA-II Clustering • Given a distance matrix, how to find clusters? – K-medoids, aka PAM (Partitioning Around Medoids) – Kernel K-means – MDS (Multidimensional scaling) K-medoids • Less requirement for data than k-means • K-means need Euclidean distances to find means • K-medoids only needs dissimilarity between two data • Use (1/Mutual Information) as dissimilarity Reference: http://cfile5.uf.tistory.com/image/033D5E35511F467D29C80E The Kernel Trick Revisited Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013 The Kernel Trick Revisited Map points to feature space using basis function 𝜑(𝑥) Replace dot product 𝜑(𝑥).𝜑(𝑦)with kernel entry 𝐾(𝑥, 𝑦) Mercer’s condition: To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)=(x)(y), K(x, y) has to be positive semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds dxdyf ( x)K ( x, y) f ( y) 0 Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013 Kernel k-means Minimize sum of squared error: Kernel k-kmeans:means: min 2 n m u x c ij i j i 1 j 1 Replace with 𝜑(𝑥) n min m u i 1 j 1 ij ( xi ) c~j m uij {0,1} u j 1 ij 2 1 Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013 Kernel k-means Cluster centers: 1 ~ cj nj n u (x ) i 1 ij i Substitute for centers: n m u i 1 j 1 ij ( xi ) c~j 2 2 n m uij i 1 j 1 1 n ( xi ) ulj ( xl ) n j l 1 Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013 Kernel k-means • Use kernel trick: n m u i 1 j 1 ij ( xi )c~j 2 traceK traceUKU • Optimization problem: min traceK traceUKU max traceUKU • K is the n x n kernel matrix, U is the optimal normalized cluster membership matrix Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013 MDS • Multidimensional scaling – place each object in N-dimensional space such that the between-object distances are preserved as well as possible. – Then do K-means with projected 2-d Euclidean distance Outline • Brief on Optimization • Mask order selection of DSMGA-II – MIMIC : A distribution approach of MBGA – Kernel K-means : A clustering approach of MBGA • Feature Selection and Feature Transformation – Selection methods : Filtering and Wrapping – Transformation methods: PCA and ICA Feature extractor/generator Data Feature selection Feature transformation Agent Features Classifications Goal PCA • Principal Component Analysis – first principal component has the largest possible variance Reference: http://weigend.com/files/teaching/stanford/2008/stanford20 08.wikispaces.com/file/view/pca_example.gif ICA • Independent Component Analysis – separate independent sources linearly mixed in sensors Reference: https://onionesquereality.files.wordpress.com/2010/01/coc ktail-party-problem.jpg Reference 1. 2. 3. 4. 5. De Bonet, J. S., Isbell, C. L., & Viola, P. (1997). MIMIC: Finding optima by estimating probability densities. Advances in neural information processing systems, 424-430. Dhillon, I. S., Guan, Y., & Kulis, B. (2004, August). Kernel k-means: spectral clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 551-556). ACM. Van der Laan, M., Pollard, K., & Bryan, J. (2003). A new partitioning around medoids algorithm. Journal of Statistical Computation and Simulation, 73(8), 575-584. Kleinberg, J. (2003). An impossibility theorem for clustering. Advances in neural information processing systems, 463-470. http://alexhwilliams.info/itsneuronalblog/2015/10/01/clustering2/