Download Machine Learning for Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Machine Learning for Data Mining
December 7, 2015
Andres Mendez-Vazquez, PhD
Email: [email protected]
Room 365
1
Introduction
The rst tool to attack data mining problem, machine learning, is a computer science discipline concerned with the
design of algorithms that allow computers to evolve behaviors based on empirical data. These algorithms can be
organized in the following hierarchy: Supervised Learning, Unsupervised Learning and Semi-supervised Learning.
Thus, data mining is more than anything an interdisciplinary subeld of computer science, dealing with the discover
of patterns in large data sets involving methods from articial intelligence, machine learning, statistics, and database
systems. Nowadays, this intersections of subelds has ourished in what many people called Data Sciences, making
Machine Learning and Data Mining one of its corner stones. However, as a precautionary tale, a person who want
to be a Data Scientists must be procient in:
1. Algorithms.
2. Data Structures.
3. Probability and Statistics.
4. Linear Algebra.
5. Linear and Non-Linear Optimization.
6. Software Engineering.
7. Machine Learning.
8. Data Mining.
9. Program Languages for prototyping, Python or R.
10. Parallel Programing in C++ and Java.
And I making this remark because there are so many people with the title Data Scientists that are disrespecting
what can be one of the most important elds in Computer Science for the XXI century.
1
Syllabus Data Structures
2
Course Objectives
This is a theoretical and practical 65 hours course that introduces the students to concepts of machine learning
and data mining for processing and analyzing data from dierent sources.
learning/data mining problems and their use for data analysis.
The emphasis is on various machine
Students will develop an understanding of the
machine learning/data mining process and issues, learn techniques for machine learning/data mining, and apply
them in solving machine learning/data mining problems.
3
Prerequisites
Linear algebra, probability, articial intelligence and analysis of algorithms.
Note:
A piece of advise, this is barely the beginning of the amount of math that you should be able to handle in
order to be successful in this area. That is the reason that a math for intelligent systems is going to be taught
during the summer.
4
Class Grading
The grades in the class will be graded in the following way
5
Task
Percentage
Midterm I
15%
Midterm II
15%
Final
15%
8 Homeworks
25%
Project
30%
Projects
The project will be in an specic problem that each student wants to work on. Possible topic are:
•
Oil exploration.
•
Association Rule Pre-Processing Project.
•
Page Ranking.
•
Web Word Relevance Measures.
•
Recomendation Systems
•
Choquet Integral/Aggregation Rules for Information Fusion
•
Something you are really interested on. Please come and talk to me.
There are more possibilities at:
•
https://www.kaggle.com/competitions
Cinvestav GDL
2
Syllabus Data Structures
•
http://aws.amazon.com/datasets/
Dates for Reviewing the Projects are:
6
Review No.
Date
1
May 29, 2014
2
June 12, 2014
3
June 26, 2014
4
July 10, 2014
5
July 24, 2014
6
August 7, 2014
About the Reading Material
Because of the scope of machine learning, dierent subjects will be obtained from dierent text books and papers.
Therefore:
1. The recommended books are at the end in the bibliography.
2. In addition, several articles will be used as we progress through the class.
7
Course Topics
I.1
Introduction
I.2
Probability and Linear Algebra Review [8, 2].
Machine Learning for Data Mining
I.3
Supervised Learning
I.3.1 Probability Classiers [1, 4, 14]
1. Discriminant Functions
2. Naive Bayes
3. Maximum Likelihood
4. Going back to Linear Classiers
5. Expectation Maximization and Mixture of Gaussians
6. Maximum a Posteriori Probability Estimation [5]
I.3.2 Kernel Based Classiers [7]
1. Introduction
2. Support Vector Machines
Cinvestav GDL
3
Syllabus Data Structures
I.3.3 Graphical Model Based Classiers [1, 4, 14]
1. Decision Trees.
2. Hidden Markov Models
•
The Viterbi Algorithm
•
Baum-Welch Algorithm
3. Neural Networks [7]
•
Perceptron
•
Multilayer Perceptron
•
Universal Approximation
•
Radial Basis Networks
•
Convolutional Networks
•
Deep Neural Networks
I.3.4 Important Issues [1, 4, 14]
1. Bias-Varance Dilemma
2. The Confusion Matrix
3. K-Cross Validation
4. Tunning of Parameters [13]
(a) Bayesian Optimization
I.3.5 Data Preparation
1. Feature selection [1, 4, 14]
(a) Preprocessing
(b) Statistical Methods
(c) Class Separability
(d) Feature Subset Selection.
2. Feature Generation [1, 4, 14]
(a) Introduction
(b) Fisher Linear Discriminant
(c) Dimensionality Reduction
i. Principal Component Analysis
ii. The Singular Value Decomposition
Cinvestav GDL
4
Syllabus Data Structures
I.3.6 Combining Classiers [1, 11]
1. Average Rules
2. Majority Voting Rule
3. A Bayesian viewpoint
4. Boosting
I.4
Unsupervised Learning [1, 4, 14, 7]
1. Introduction
2. Proximity Measures.
3. Basic Clustering Algorithms: K-Means and Mixture of Gaussians
4. Clustering Based in Cost Functions: Fuzzy c-means, Possibilistic Clustering.
5. Hierarchical Clustering
6. Self-Organization Maps
7. Cluster Validity
I.5
Semi-supervised Learning [3]
1. Introduction
2. Text classication using EM
3. Transductive Support Vector Machines
Data Mining Techniques and Applications
I.6
Mining the Web for Structured Data [6, 9, 10, 15]
1. Why do we want to deal with data from the web?
I.7
Frequent Itemsets and Association rules [6, 15]
1. The Market Problem
2. The basic algorithm
3. Improvements
Cinvestav GDL
5
Syllabus Data Structures
I.8
Near Neighbor Search in High Dimensional Data [10, 12]
1. Locality Sensitive Hashing
2. Near Neighbor Search
I.9
Structure of the Webgraph [9]
1. Page Rank Algorithm
2. Topic Specic Page Rank
3. Trust Rank
Cinvestav GDL
6
Syllabus Data Structures
References
[1] Christopher M. Bishop.
Pattern Recognition and Machine Learning (Information Science and Statistics).
Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
[2] George Casella and Roger Berger.
Statistical Inference.
Duxbury Resource Center, June 2001.
[3] Olivier Chapelle, Bernhard Schlkopf, and Alexander Zien.
Semi-Supervised Learning.
The MIT Press, 1st
edition, 2010.
[4] Richard O. Duda, Peter E. Hart, and David G. Stork.
Pattern Classication (2Nd Edition). Wiley-Interscience,
2000.
[5] Mário A. T. Figueiredo. Adaptive sparseness for supervised learning.
IEEE Trans. Pattern Anal. Mach. Intell.,
25(9):11501159, September 2003.
[6] Jiawei Han.
Data Mining: Concepts and Techniques.
Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2011.
[7] Simon Haykin.
Neural Networks and Learning Machines.
Prentice-Hall, Inc., Upper Saddle River, NJ, USA,
2008.
[8] Kenneth M. Homan and Ray Kunze.
[9] Amy N. Langville and Carl D. Meyer.
Linear algebra.
Prentice-Hall, 1971.
Google's PageRank and Beyond: The Science of Search Engine Rankings.
Princeton University Press, Princeton, NJ, USA, 2006.
[10] Anand Rajaraman and Jerey David Ullman.
Mining of Massive Datasets.
Cambridge University Press, New
York, NY, USA, 2011.
o
[11] Raà l Rojas. Adaboost and the super bowl of classiers a tutorial introduction to adaptive boosting. 2009.
Foundations of Multidimensional and Metric Data Structures (The Morgan Kaufmann Series
in Computer Graphics and Geometric Modeling). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
[12] Hanan Samet.
2005.
[13] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams.
algorithms.
Practical bayesian optimization of machine learning
c on Bottou, and
In Peter L. Bartlett, Fernando C. N. Pereira, Christopher J. C. Burges, LÃ
Kilian Q. Weinberger, editors,
NIPS, pages 29602968, 2012.
[14] Sergios Theodoridis and Konstantinos Koutroumbas.
Pattern Recognition, Fourth Edition.
Academic Press,
4th edition, 2008.
Data Mining: Practical Machine Learning Tools and Techniques, Second
Edition (Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann Publishers Inc., San
[15] Ian H. Witten and Eibe Frank.
Francisco, CA, USA, 2005.
Cinvestav GDL
7