Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Some Slides extracted from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002. Other slides from CS 545 at Colorado State University, Chuck Anderson CSE 5331/7331 F'07 © Prentice Hall 1 Table of Contents Introduction (Chuck Anderson) Statistical Machine Learning Examples – Estimation – EM – Bayes Theorem Decision Tree Learning Neural Network Learning CSE 5331/7331 F'07 2 The slides in this introductory section are from CS545: Machine Learning By Chuck Anderson Department of Computer Science Colorado State University Fall 2006 CSE 5331/7331 F'07 3 What is Machine Learning? Statistics ≈ the science of inference from data Machine learning ≈ multivariate statistics + computational statistics Multivariate statistics ≈ prediction of values of a function assumed to underlie a multivariate dataset Computational statistics ≈ computational methods for statistical problems (aka statistical computation) + statistical methods which happen to be computationally intensive Data Mining ≈ exploratory data analysis, particularly with massive/complex datasets CSE 5331/7331 F'07 4 Kinds of Learning Learning algorithms are often categorized according to the amount of information provided: Least Information: – Unsupervised learning is more exploratory. – Requires samples of inputs. Must find regularities. More Information: – Reinforcement learning most recent. – Requires samples of inputs, actions, and rewards or punishments. Most Information: – Supervised learning is most common. – Requires samples of inputs and desired outputs. CSE 5331/7331 F'07 5 Examples of Algorithms Supervised learning – Regression » multivariate regression » neural networks and kernel methods – Classification » linear and quadratic discrimination analysis » k-nearest neighbors » neural networks and kernel methods Reinforcement learning – multivariate regression – neural networks Unsupervised learning – principal components analysis – k-means clustering – self-organizing networks CSE 5331/7331 F'07 6 CSE 5331/7331 F'07 7 CSE 5331/7331 F'07 8 CSE 5331/7331 F'07 9 CSE 5331/7331 F'07 10 Table of Contents Introduction (Chuck Anderson) Statistical Machine Learning Examples – Estimation – EM – Bayes Theorem Decision Tree Learning Neural Network Learning CSE 5331/7331 F'07 11 Point Estimation Point Estimate: estimate a population parameter. May be made by calculating the parameter for a sample. May be used to predict value for missing data. Ex: – – – – R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employee’s salary. Is this a good idea? CSE 5331/7331 F'07 © Prentice Hall 12 Estimation Error Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value: Why square? Root Mean Square Error (RMSE) CSE 5331/7331 F'07 © Prentice Hall 13 Jackknife Estimate Jackknife Estimate: estimate of parameter is obtained by omitting one value from the set of observed values. Ex: estimate of mean for X={x1, … , xn} CSE 5331/7331 F'07 © Prentice Hall 14 Maximum Likelihood Estimate (MLE) Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model. Joint probability for observing the sample data by multiplying the individual probabilities. Likelihood function: Maximize L. CSE 5331/7331 F'07 © Prentice Hall 15 MLE Example Coin toss five times: {H,H,H,H,T} Assuming a perfect coin with H and T equally likely, the likelihood of this sequence is: However if the probability of a H is 0.8 then: CSE 5331/7331 F'07 © Prentice Hall 16 MLE Example (cont’d) General likelihood formula: Estimate for p is then 4/5 = 0.8 CSE 5331/7331 F'07 © Prentice Hall 17 Expectation-Maximization (EM) Solves estimation with incomplete data. Obtain initial estimates for parameters. Iteratively use estimates for missing data and continue until convergence. CSE 5331/7331 F'07 © Prentice Hall 18 EM Example CSE 5331/7331 F'07 © Prentice Hall 19 EM Algorithm CSE 5331/7331 F'07 © Prentice Hall 20 Bayes Theorem Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem: Assign probabilities of hypotheses given a data value. CSE 5331/7331 F'07 © Prentice Hall 21 Bayes Theorem Example Credit authorizations (hypotheses): h1=authorize purchase, h2 = authorize after further identification, h3=do not authorize, h4= do not authorize but contact police Assign twelve data values for all combinations of credit and income: 1 Excellent Good Bad x1 x5 x9 2 3 4 x2 x6 x10 x3 x7 x11 x4 x8 x12 From training data: P(h1) = 60%; P(h2)=20%; P(h3)=10%; P(h4)=10%. CSE 5331/7331 F'07 © Prentice Hall 22 Bayes Example(cont’d) Training Data: ID 1 2 3 4 5 6 7 8 9 10 CSE 5331/7331 F'07 Income 4 3 2 3 4 2 3 2 3 1 Credit Excellent Good Excellent Good Good Excellent Bad Bad Bad Bad © Prentice Hall Class h1 h1 h1 h1 h1 h1 h2 h2 h3 h4 xi x4 x7 x2 x7 x8 x2 x11 x10 x11 x9 23 Bayes Example(cont’d) Calculate P(xi|hj) and P(xi) Ex: P(x7|h1)=2/6; P(x4|h1)=1/6; P(x2|h1)=2/6; P(x8|h1)=1/6; P(xi|h1)=0 for all other xi. Predict the class for x4: – Calculate P(hj|x4) for all hj. – Place x4 in class with largest value. – Ex: »P(h1|x4)=(P(x4|h1)(P(h1))/P(x4) =(1/6)(0.6)/0.1=1. »x4 in class h1. CSE 5331/7331 F'07 © Prentice Hall 24 Table of Contents Introduction (Chuck Anderson) Statistical Machine Learning Examples – Estimation – EM – Bayes Theorem Decision Tree Learning Neural Network Learning CSE 5331/7331 F'07 25 Twenty Questions Game CSE 5331/7331 F'07 © Prentice Hall 26 Decision Trees Decision Tree (DT): – Tree where the root and each internal node is labeled with a question. – The arcs represent each possible answer to the associated question. – Each leaf node represents a prediction of a solution to the problem. Popular technique for classification; Leaf node indicates class to which the corresponding tuple belongs. CSE 5331/7331 F'07 © Prentice Hall 27 Decision Tree Example CSE 5331/7331 F'07 © Prentice Hall 28 Decision Trees How do you build a good DT? What is a good DT? Ans: Supervised Learning CSE 5331/7331 F'07 © Prentice Hall 29 Comparing DTs Balanced Deep CSE 5331/7331 F'07 © Prentice Hall 30 Decision Tree Induction is often based on Information Theory So CSE 5331/7331 F'07 © Prentice Hall 31 Information CSE 5331/7331 F'07 © Prentice Hall 32 DT Induction When all the marbles in the bowl are mixed up, little information is given. When the marbles in the bowl are all from one class and those in the other two classes are on either side, more information is given. Use this approach with DT Induction ! CSE 5331/7331 F'07 © Prentice Hall 33 Information/Entropy Given probabilitites p1, p2, .., ps whose sum is 1, Entropy is defined as: Entropy measures the amount of randomness or surprise or uncertainty. Goal in classification – no surprise – entropy = 0 CSE 5331/7331 F'07 © Prentice Hall 34 Table of Contents Introduction (Chuck Anderson) Statistical Machine Learning Examples – Estimation – EM – Bayes Theorem Decision Tree Learning Neural Network Learning CSE 5331/7331 F'07 35 Neural Networks Based on observed functioning of human brain. (Artificial Neural Networks (ANN) Our view of neural networks is very simplistic. We view a neural network (NN) from a graphical viewpoint. Alternatively, a NN may be viewed from the perspective of matrices. Used in pattern recognition, speech recognition, computer vision, and classification. CSE 5331/7331 F'07 © Prentice Hall 36 Neural Networks Neural Network (NN) is a directed graph F=<V,A> with vertices V={1,2,…,n} and arcs A={<i,j>|1<=i,j<=n}, with the following restrictions: – V is partitioned into a set of input nodes, VI, hidden nodes, VH, and output nodes, VO. – The vertices are also partitioned into layers – Any arc <i,j> must have node i in layer h-1 and node j in layer h. – Arc <i,j> is labeled with a numeric value wij. – Node i is labeled with a function fi. CSE 5331/7331 F'07 © Prentice Hall 37 Neural Network Example CSE 5331/7331 F'07 © Prentice Hall 38 NN Node CSE 5331/7331 F'07 © Prentice Hall 39 NN Activation Functions Functions associated with nodes in graph. Output may be in range [-1,1] or [0,1] CSE 5331/7331 F'07 © Prentice Hall 40 NN Activation Functions CSE 5331/7331 F'07 © Prentice Hall 41 NN Learning Propagate input values through graph. Compare output to desired output. Adjust weights in graph accordingly. CSE 5331/7331 F'07 © Prentice Hall 42