Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Data Mining December 7, 2015 Andres Mendez-Vazquez, PhD Email: [email protected] Room 365 1 Introduction The rst tool to attack data mining problem, machine learning, is a computer science discipline concerned with the design of algorithms that allow computers to evolve behaviors based on empirical data. These algorithms can be organized in the following hierarchy: Supervised Learning, Unsupervised Learning and Semi-supervised Learning. Thus, data mining is more than anything an interdisciplinary subeld of computer science, dealing with the discover of patterns in large data sets involving methods from articial intelligence, machine learning, statistics, and database systems. Nowadays, this intersections of subelds has ourished in what many people called Data Sciences, making Machine Learning and Data Mining one of its corner stones. However, as a precautionary tale, a person who want to be a Data Scientists must be procient in: 1. Algorithms. 2. Data Structures. 3. Probability and Statistics. 4. Linear Algebra. 5. Linear and Non-Linear Optimization. 6. Software Engineering. 7. Machine Learning. 8. Data Mining. 9. Program Languages for prototyping, Python or R. 10. Parallel Programing in C++ and Java. And I making this remark because there are so many people with the title Data Scientists that are disrespecting what can be one of the most important elds in Computer Science for the XXI century. 1 Syllabus Data Structures 2 Course Objectives This is a theoretical and practical 65 hours course that introduces the students to concepts of machine learning and data mining for processing and analyzing data from dierent sources. learning/data mining problems and their use for data analysis. The emphasis is on various machine Students will develop an understanding of the machine learning/data mining process and issues, learn techniques for machine learning/data mining, and apply them in solving machine learning/data mining problems. 3 Prerequisites Linear algebra, probability, articial intelligence and analysis of algorithms. Note: A piece of advise, this is barely the beginning of the amount of math that you should be able to handle in order to be successful in this area. That is the reason that a math for intelligent systems is going to be taught during the summer. 4 Class Grading The grades in the class will be graded in the following way 5 Task Percentage Midterm I 15% Midterm II 15% Final 15% 8 Homeworks 25% Project 30% Projects The project will be in an specic problem that each student wants to work on. Possible topic are: • Oil exploration. • Association Rule Pre-Processing Project. • Page Ranking. • Web Word Relevance Measures. • Recomendation Systems • Choquet Integral/Aggregation Rules for Information Fusion • Something you are really interested on. Please come and talk to me. There are more possibilities at: • https://www.kaggle.com/competitions Cinvestav GDL 2 Syllabus Data Structures • http://aws.amazon.com/datasets/ Dates for Reviewing the Projects are: 6 Review No. Date 1 May 29, 2014 2 June 12, 2014 3 June 26, 2014 4 July 10, 2014 5 July 24, 2014 6 August 7, 2014 About the Reading Material Because of the scope of machine learning, dierent subjects will be obtained from dierent text books and papers. Therefore: 1. The recommended books are at the end in the bibliography. 2. In addition, several articles will be used as we progress through the class. 7 Course Topics I.1 Introduction I.2 Probability and Linear Algebra Review [8, 2]. Machine Learning for Data Mining I.3 Supervised Learning I.3.1 Probability Classiers [1, 4, 14] 1. Discriminant Functions 2. Naive Bayes 3. Maximum Likelihood 4. Going back to Linear Classiers 5. Expectation Maximization and Mixture of Gaussians 6. Maximum a Posteriori Probability Estimation [5] I.3.2 Kernel Based Classiers [7] 1. Introduction 2. Support Vector Machines Cinvestav GDL 3 Syllabus Data Structures I.3.3 Graphical Model Based Classiers [1, 4, 14] 1. Decision Trees. 2. Hidden Markov Models • The Viterbi Algorithm • Baum-Welch Algorithm 3. Neural Networks [7] • Perceptron • Multilayer Perceptron • Universal Approximation • Radial Basis Networks • Convolutional Networks • Deep Neural Networks I.3.4 Important Issues [1, 4, 14] 1. Bias-Varance Dilemma 2. The Confusion Matrix 3. K-Cross Validation 4. Tunning of Parameters [13] (a) Bayesian Optimization I.3.5 Data Preparation 1. Feature selection [1, 4, 14] (a) Preprocessing (b) Statistical Methods (c) Class Separability (d) Feature Subset Selection. 2. Feature Generation [1, 4, 14] (a) Introduction (b) Fisher Linear Discriminant (c) Dimensionality Reduction i. Principal Component Analysis ii. The Singular Value Decomposition Cinvestav GDL 4 Syllabus Data Structures I.3.6 Combining Classiers [1, 11] 1. Average Rules 2. Majority Voting Rule 3. A Bayesian viewpoint 4. Boosting I.4 Unsupervised Learning [1, 4, 14, 7] 1. Introduction 2. Proximity Measures. 3. Basic Clustering Algorithms: K-Means and Mixture of Gaussians 4. Clustering Based in Cost Functions: Fuzzy c-means, Possibilistic Clustering. 5. Hierarchical Clustering 6. Self-Organization Maps 7. Cluster Validity I.5 Semi-supervised Learning [3] 1. Introduction 2. Text classication using EM 3. Transductive Support Vector Machines Data Mining Techniques and Applications I.6 Mining the Web for Structured Data [6, 9, 10, 15] 1. Why do we want to deal with data from the web? I.7 Frequent Itemsets and Association rules [6, 15] 1. The Market Problem 2. The basic algorithm 3. Improvements Cinvestav GDL 5 Syllabus Data Structures I.8 Near Neighbor Search in High Dimensional Data [10, 12] 1. Locality Sensitive Hashing 2. Near Neighbor Search I.9 Structure of the Webgraph [9] 1. Page Rank Algorithm 2. Topic Specic Page Rank 3. Trust Rank Cinvestav GDL 6 Syllabus Data Structures References [1] Christopher M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. [2] George Casella and Roger Berger. Statistical Inference. Duxbury Resource Center, June 2001. [3] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien. Semi-Supervised Learning. The MIT Press, 1st edition, 2010. [4] Richard O. Duda, Peter E. Hart, and David G. Stork. Pattern Classication (2Nd Edition). Wiley-Interscience, 2000. [5] Mário A. T. Figueiredo. Adaptive sparseness for supervised learning. IEEE Trans. Pattern Anal. Mach. Intell., 25(9):11501159, September 2003. [6] Jiawei Han. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2011. [7] Simon Haykin. Neural Networks and Learning Machines. Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 2008. [8] Kenneth M. Homan and Ray Kunze. [9] Amy N. Langville and Carl D. Meyer. Linear algebra. Prentice-Hall, 1971. Google's PageRank and Beyond: The Science of Search Engine Rankings. Princeton University Press, Princeton, NJ, USA, 2006. [10] Anand Rajaraman and Jerey David Ullman. Mining of Massive Datasets. Cambridge University Press, New York, NY, USA, 2011. o [11] Raà l Rojas. Adaboost and the super bowl of classiers a tutorial introduction to adaptive boosting. 2009. Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, [12] Hanan Samet. 2005. [13] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. algorithms. Practical bayesian optimization of machine learning c on Bottou, and In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, Là Kilian Q. Weinberger, editors, NIPS, pages 29602968, 2012. [14] Sergios Theodoridis and Konstantinos Koutroumbas. Pattern Recognition, Fourth Edition. Academic Press, 4th edition, 2008. Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San [15] Ian H. Witten and Eibe Frank. Francisco, CA, USA, 2005. Cinvestav GDL 7