Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Machine Learning for Language Technology (Schedule) Lecture 1: Introduction Last updated: 2 Sept 2013 Uppsala University Department of Linguistics and Philology, September 2013 1 Lecture 1: Introduction Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations Practical Information Reading list, assignments, exams, etc. 2 Lecture 1: Introduction Course web pages: http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf http://stp.lingfil.uu.se/~santinim/ml/MachineLearning_fall2013.htm Contact details: [email protected] ([email protected]) 3 Lecture 1: Introduction About the Course Introduction to machine learning Focus on methods used in Language Technology and NLP Decision trees and nearest neighbor methods (Lecture 2) Linear models –The Weka ML package (Lectures 3 and 4) Ensemble methods – Structured Predictions (Lectures 5 and 6) Text Mining and Big Data - R and Rapid Miner(Lecture 7) Unsupervised learning (clustering) (Lecture 8, Magnus Rosell) Builds on Statistical Methods in NLP 4 Mostly discriminative methods Generative probability models covered in first course Lecture 1: Introduction Digression: Generative vs. Discriminative Methods A generative method only applies to probabilistic models. A model is generative if it gives us the model of the joint distribution of x and y together (P (Y , X ). It is called generative because you can generate with the correct probability distribution data points. Conditional methods model the conditional distribution of the output given the input: P (Y | X ). Discriminative methods do not model probabilities at all, but they map the input to the output directly. 5 Lecture 1: Introduction Compulsory Reading List Main textbooks: 1. 2. 3. Ethem Alpaydin. 2010. Introduction to Machine Learning. Second Edition. MIT Press (free online version) Hal Daumé III. 2012. A Course in Machine Learning (free online version) Ian H. Witten, Eibe Frank Data Mining: Practical Machine Learning Tools and Techniques, Second Edition (free online version) Additional material : 1. 2. 3. 6 Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In J. Kittler and F. Roli (Ed.) First International Workshop on Multiple Classifier Systems Michael Collins. 2002. Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms.In Proceedings of the 2002 Conference on Empirical Methods in Natural Language Processing Hanna M. Wallach. 2004. Conditional Random Fields: An Introduction.Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania. Lecture 1: Introduction Optional Reading Hal Daumé III, John Langford, Daniel Marcu (2005) Search-Based Structured Prediction as Classification. NIPS Workshop on Advances in Structured Learning for Text and Speech Processing (ASLTSP). 7 Lecture 1: Introduction Assignments and Examination Three Assignments: Decision trees and nearest neighbors Perceptron learning Clustering General Info: No lab sessions, supervision by email Reports and Assignments 1 & 2 must be submitted to [email protected] Report and Assignment 3 must be submitted to [email protected] Examination: 8 Written report submitted for each assignment All three assignments necessary to pass the course Grade determined by majority grade on assignments Lecture 1: Introduction Practical Organization Marina Santini (1-7); Magnus Rosell (8) 45min + 15 min break Lectures on Course webpage and SlideShare Email all your questions to me: [email protected] Video Recordings of the previous ML course: http://stp.lingfil.uu.se/~nivre/master/ml.html Send me an email, so I make sure that I have all the correct email addresses to [email protected] 9 Lecture 1: Introduction Schedule: 10 http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf Lecture 1: Introduction What is Machine Learning? Introduction to: •Classification •Regression •Supervised Learning •Unsupervised Learning •Reinforcement Learning 11 Lecture 1: Introduction What is Machine Learning Machine learning is programming computers to optimize a performance criterion for some task using example data or past experience Why learning? No known exact method – vision, speech recognition, robotics, spam filters, etc. Exact method too expensive – statistical physics Task evolves over time – network routing Compare: 12 No need to use machine learning for computing payroll… we just need an algorithm Lecture 1: Introduction Machine Learning – Data Mining – Artificial Intelligence – Statistics Machine Learning: creation of a model that uses training data or past experience Data Mining: application of learning methods to large datasets (ex. physics, astronomy, biology, etc.) Text mining = machine learning applied to unstructured textual data (ex. sentiment analyisis, social media monitoring, etc. Text Mining, Wikipedia) Artificial intelligence: a model that can adapt to a changing environment. Statistics: Machine learning uses the theory of statistics in building mathematical models, because the core task is making inference from a sample. 13 Lecture 1: Introduction The bio-cognitive analogy Imagine that a learning algorithm as a single neuron. This neuron receives input from other neurons, one for each input feature. The strength of these inputs are the feature values. Each input has a weight and the neuron simply sums up all the weighted inputs. Based on this sum, the neuron decides whether to “fire” or not. Firing is interpreted as being a positive example and not firing is interpreted as being a negative example. 14 Lecture 1: Introduction Elements of Machine Learning Generalization: 1. Generalize from specific examples Based on statistical inference Data: 2. Training data: specific examples to learn from Test data: (new) specific examples to assess performance Models: 3. Theoretical assumptions about the task/domain Parameters that can be inferred from data Algorithms: 4. 15 Learning algorithm: infer model (parameters) from data Inference algorithm: infer predictions from model Lecture 1: Introduction Types of Machine Learning Association Supervised Learning Classification Regression Unsupervised Learning Reinforcement Learning 16 Lecture 1: Introduction Learning Associations Basket analysis: P (Y | X ) probability that somebody who buys X also buys Y where X and Y are products/services Example: P ( chips | beer ) = 0.7 17 Lecture 1: Introduction Classification Example: Credit scoring Differentiating between low-risk and high-risk customers from their income and savings Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk 18 Lecture 1: Introduction Classification in NLP Binary classification: Multiclass classification: Spam filtering (spam vs. non-spam) Spelling error detection (error vs. non error) Text categorization (news, economy, culture, sport, ...) Named entity classification (person, location, organization, ...) Structured prediction: 19 Part-of-speech tagging (classes = tag sequences) Syntactic parsing (classes = parse trees) Lecture 1: Introduction Regression Example: Price of used car x : car attributes y : price y = g (x | q ) g ( ) model, q parameters 20 y = wx+w0 Lecture 1: Introduction Uses of Supervised Learning Prediction of future cases: Knowledge extraction: The rule is easy to understand Compression: Use the rule to predict the output for future inputs The rule is simpler than the data it explains Outlier detection: 21 Exceptions that are not covered by the rule, e.g., fraud Lecture 1: Introduction Unsupervised Learning Finding regularities in data No mapping to outputs Clustering: Grouping similar instances Example applications: 22 Customer segmentation in CRM Image compression: Color quantization NLP: Unsupervised text categorization Lecture 1: Introduction Reinforcement Learning Learning a policy = sequence of outputs/actions No supervised output but delayed reward Example applications: Game playing Robot in a maze NLP: Dialogue systems, for example: NJFun: A Reinforcement Learning Spoken Dialogue System (http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf) Reinforcement Learning for Spoken Dialogue Systems: Comparing Strengths and Weaknesses for Practical Deployment (http://research.microsoft.com/apps/pubs/default.aspx?id=70295) 23 Lecture 1: Introduction Supervised Learning Introduction to: •Margin •Noise •Bias 24 Lecture 1: Introduction Supervised Classification Learning the class C of a “family car” from examples Prediction: Is car x a family car? Knowledge extraction: What do people expect from a family car? Output (labels): Positive (+) and negative (–) examples Input representation (features): x1: price, x2 : engine power 25 Lecture 1: Introduction Training set X X {x t,r t }Nt1 1 if x is positive r 0 if x is negative x1 x x 2 26 Lecture 1: Introduction Hypothesis class H p1 price p2 AND e1 engine power e2 27 Lecture 1: Introduction Empirical (training) error 1 if h says x is positive h(x) 0 if h says x is negative Empirical error of h on X: N E(h | X ) 1 hx t r t t1 28 Lecture 1: Introduction S, G, and the Version Space most specific hypothesis, S most general hypothesis, G h H, between S and G is consistent [E( h | X) = 0] and make up the version space 29 Lecture 1: Introduction Margin Choose h with largest margin 30 Lecture 1: Introduction Noise Unwanted anomaly in data Imprecision in input attributes Errors in labeling data points Hidden attributes (relative to H) Consequence: No h in H may be consistent! 31 Lecture 1: Introduction Noise and Model Complexity Arguments for simpler model (Occam’s razor principle) Easier to make predictions Easier to train (fewer parameters) Easier to understand Generalizes better (if data is noisy) 1. 2. 3. 4. 32 Lecture 1: Introduction Inductive Bias Learning is an ill-posed problem Training data is never sufficient to find a unique solution There are always infinitely many consistent hypotheses We need an inductive bias: Assumptions that entail a unique h for a training set X 1. Hypothesis class H – axis-aligned rectangles 2. Learning algorithm – find consistent hypothesis with max-margin Hyperparameters – trade-off between training error and margin 3. 33 Lecture 1: Introduction Model Selection and Generalization Generalization – how well a model performs on new data 34 Overfitting: H more complex than C Underfitting: H less complex than C Lecture 1: Introduction Triple Trade-Off Trade-off between three factors: 1. 2. 3. Complexity of H, c(H) Training set size N Generalization error E on new data Dependencies: 35 As N E As c(H) first E and then E Lecture 1: Introduction Model Selection Generalization Error To estimate generalization error, we need data unseen during training: M Eˆ E(h | V) 1 hx t r t t1 V {x t,r t }M t1 X Given models (hypotheses) h1, ..., hk induced from the training set X, we can use E(h i | V ) to select the model hi with the smallest generalization error 36 Lecture 1: Introduction Model Assessment To estimate the generalization error of the best model hi, we need data unseen during training and model selection Standard setup: 1. 2. 3. Training set X (50–80%) Validation (development) set V (10–25%) Test (publication) set T (10–25%) Note: 37 Validation data can be added to training set before testing Resampling methods can be used if data is limited Lecture 1: Introduction Cross-Validation K-fold cross-validation: Divide X into X1, ..., XK V 1 X1 T 1 X 2 X 3 X K V 2 X 2 T 2 X1 X 3 X K Note: 38 V K XK T K X1 X 2 X K 1 Generalization error estimated by means across K folds Training sets for different folds share K–2 parts Separate test set must be maintained for model assessment Lecture 1: Introduction Bootstrapping Generate new training sets of size N from X by random sampling with replacement Use original training set as validation set (V = X ) Probability that we do not pick an instance after N draws N 1 1 1 e 0.368 N that is, only 36.8% of instances are new! 39 Lecture 1: Introduction Measuring Error Error rate = # of errors / # of instances = (FP+FN) / N Accuracy = # of correct / # of instances = (TP+TN) / N Recall = # of found positives / # of positives = TP / (TP+FN) Precision = # of found positives / # of found = TP / (TP+FP) 40 Lecture 1: Introduction Statistical Inference Interval estimation to quantify the precision of our measurements m 1.96 N Hypothesis testing to assess whether differences between models are statistically significant e01 e10 1 2 e01 e10 41 ~ X12 Lecture 1: Introduction Supervised Learning – Summary Training data + learner hypothesis Test data + hypothesis estimated generalization Learner incorporates inductive bias Test data must be unseen Next lectures: 42 Different learners in LT with different inductive biases Lecture 1: Introduction Anatomy of a Supervised Learner (Dimensions of a supervised machine learning algorithm) Model: gx | q Loss function: E q | X L r t ,gx t | q t Optimization q* arg min E q | X q procedure: 43 Lecture 1: Introduction Reading Alpaydin (2010): Ch. 1-2; 19 (mathematical underpinnings) Witten and Frank (2005): Ch. 1 (examples and domains of application) 44 Lecture 1: Introduction End of Lecture 1 Thanks for your attention 45 Lecture 1: Introduction