Download X - STP

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Machine Learning for Language Technology
(Schedule)
Lecture 1: Introduction
Last updated: 2 Sept 2013
Uppsala University
Department of Linguistics and Philology, September 2013
1
Lecture 1: Introduction
Most slides from previous courses (i.e. from E. Alpaydin and J. Nivre) with adaptations
Practical Information
Reading list, assignments, exams, etc.
2
Lecture 1: Introduction
Course web pages:

http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf

http://stp.lingfil.uu.se/~santinim/ml/MachineLearning_fall2013.htm

Contact details:


[email protected]
([email protected])
3
Lecture 1: Introduction
About the Course
Introduction to machine learning
Focus on methods used in Language Technology and NLP







Decision trees and nearest neighbor methods (Lecture 2)
Linear models –The Weka ML package (Lectures 3 and 4)
Ensemble methods – Structured Predictions (Lectures 5 and 6)
Text Mining and Big Data - R and Rapid Miner(Lecture 7)
Unsupervised learning (clustering) (Lecture 8, Magnus Rosell)
Builds on Statistical Methods in NLP



4
Mostly discriminative methods
Generative probability models covered in first course
Lecture 1: Introduction
Digression: Generative vs. Discriminative Methods

A generative method only applies to probabilistic models.
A model is generative if it gives us the model of the joint
distribution of x and y together (P (Y , X ). It is called
generative because you can generate with the correct
probability distribution data points.

Conditional methods model the conditional distribution
of the output given the input: P (Y | X ).

Discriminative methods do not model probabilities at all,
but they map the input to the output directly.
5
Lecture 1: Introduction
Compulsory Reading List
Main textbooks:

1.
2.
3.
Ethem Alpaydin. 2010. Introduction to Machine Learning. Second
Edition. MIT Press (free online version)
Hal Daumé III. 2012. A Course in Machine Learning (free online
version)
Ian H. Witten, Eibe Frank Data Mining: Practical Machine Learning
Tools and Techniques, Second Edition (free online version)
Additional material :

1.
2.
3.
6
Dietterich, T. G. (2000). Ensemble Methods in Machine Learning. In J.
Kittler and F. Roli (Ed.) First International Workshop on Multiple
Classifier Systems
Michael Collins. 2002. Discriminative Training Methods for Hidden
Markov Models: Theory and Experiments with Perceptron
Algorithms.In Proceedings of the 2002 Conference on Empirical
Methods in Natural Language Processing
Hanna M. Wallach. 2004. Conditional Random Fields: An
Introduction.Technical Report MS-CIS-04-21. Department of
Computer and Information Science, University of Pennsylvania.
Lecture 1: Introduction
Optional Reading
Hal Daumé III, John Langford, Daniel Marcu (2005)
Search-Based Structured Prediction as Classification. NIPS
Workshop on Advances in Structured Learning for Text
and Speech Processing (ASLTSP).

7
Lecture 1: Introduction
Assignments and Examination
Three Assignments:




Decision trees and nearest neighbors
Perceptron learning
Clustering
General Info:




No lab sessions, supervision by email
Reports and Assignments 1 & 2 must be submitted to
[email protected]
Report and Assignment 3 must be submitted to [email protected]
Examination:




8
Written report submitted for each assignment
All three assignments necessary to pass the course
Grade determined by majority grade on assignments
Lecture 1: Introduction
Practical Organization
Marina Santini (1-7); Magnus Rosell (8)
45min + 15 min break


Lectures on Course webpage and SlideShare
Email all your questions to me: [email protected]
Video Recordings of the previous ML course:
http://stp.lingfil.uu.se/~nivre/master/ml.html



Send me an email, so I make sure that I have all the correct
email addresses to [email protected]

9
Lecture 1: Introduction
Schedule:
10
http://stp.lingfil.uu.se/~santinim/ml/ml_fall2013.pdf
Lecture 1: Introduction
What is Machine Learning?
Introduction to:
•Classification
•Regression
•Supervised Learning
•Unsupervised Learning
•Reinforcement Learning
11
Lecture 1: Introduction
What is Machine Learning


Machine learning is programming computers to optimize
a performance criterion for some task using example data
or past experience
Why learning?




No known exact method – vision, speech recognition, robotics,
spam filters, etc.
Exact method too expensive – statistical physics
Task evolves over time – network routing
Compare:

12
No need to use machine learning for computing payroll… we
just need an algorithm
Lecture 1: Introduction
Machine Learning – Data Mining – Artificial
Intelligence – Statistics

Machine Learning: creation of a model that uses training data
or past experience

Data Mining: application of learning methods to large datasets
(ex. physics, astronomy, biology, etc.)

Text mining = machine learning applied to unstructured textual data
(ex. sentiment analyisis, social media monitoring, etc. Text Mining,
Wikipedia)

Artificial intelligence: a model that can adapt to a changing
environment.

Statistics: Machine learning uses the theory of statistics in
building mathematical models, because the core task is
making inference from a sample.
13
Lecture 1: Introduction
The bio-cognitive analogy





Imagine that a learning algorithm as a single neuron.
This neuron receives input from other neurons, one for
each input feature.
The strength of these inputs are the feature values.
Each input has a weight and the neuron simply sums up all
the weighted inputs.
Based on this sum, the neuron decides whether to “fire”
or not. Firing is interpreted as being a positive example
and not firing is interpreted as being a negative example.
14
Lecture 1: Introduction
Elements of Machine Learning
Generalization:
1.


Generalize from specific examples
Based on statistical inference
Data:
2.


Training data: specific examples to learn from
Test data: (new) specific examples to assess performance
Models:
3.


Theoretical assumptions about the task/domain
Parameters that can be inferred from data
Algorithms:
4.


15
Learning algorithm: infer model (parameters) from data
Inference algorithm: infer predictions from model
Lecture 1: Introduction
Types of Machine Learning


Association
Supervised Learning




Classification
Regression
Unsupervised Learning
Reinforcement Learning
16
Lecture 1: Introduction
Learning Associations

Basket analysis:
P (Y | X ) probability that somebody who buys X also
buys Y where X and Y are products/services
Example: P ( chips | beer ) = 0.7
17
Lecture 1: Introduction
Classification


Example: Credit scoring
Differentiating between
low-risk and high-risk
customers from their
income and savings
Discriminant: IF income > θ1 AND savings > θ2
THEN low-risk ELSE high-risk
18
Lecture 1: Introduction
Classification in NLP

Binary classification:



Multiclass classification:



Spam filtering (spam vs. non-spam)
Spelling error detection (error vs. non error)
Text categorization (news, economy, culture, sport, ...)
Named entity classification (person, location, organization, ...)
Structured prediction:


19
Part-of-speech tagging (classes = tag sequences)
Syntactic parsing (classes = parse trees)
Lecture 1: Introduction
Regression


Example:
Price of used car
x : car attributes
y : price
y = g (x | q )
g ( ) model,
q parameters
20
y = wx+w0
Lecture 1: Introduction
Uses of Supervised Learning

Prediction of future cases:


Knowledge extraction:


The rule is easy to understand
Compression:


Use the rule to predict the output for future inputs
The rule is simpler than the data it explains
Outlier detection:

21
Exceptions that are not covered by the rule, e.g., fraud
Lecture 1: Introduction
Unsupervised Learning



Finding regularities in data
No mapping to outputs
Clustering:


Grouping similar instances
Example applications:



22
Customer segmentation in CRM
Image compression: Color quantization
NLP: Unsupervised text categorization
Lecture 1: Introduction
Reinforcement Learning



Learning a policy = sequence of outputs/actions
No supervised output but delayed reward
Example applications:



Game playing
Robot in a maze
NLP: Dialogue systems, for example:


NJFun: A Reinforcement Learning Spoken Dialogue System
(http://acl.ldc.upenn.edu/W/W00/W00-0304.pdf)
Reinforcement Learning for Spoken Dialogue Systems: Comparing
Strengths and Weaknesses for Practical Deployment
(http://research.microsoft.com/apps/pubs/default.aspx?id=70295)

23
Lecture 1: Introduction
Supervised Learning
Introduction to:
•Margin
•Noise
•Bias
24
Lecture 1: Introduction
Supervised Classification

Learning the class C of a “family car” from
examples



Prediction: Is car x a family car?
Knowledge extraction: What do people expect
from a family car?
Output (labels):
Positive (+) and negative (–) examples

Input representation (features):
x1: price, x2 : engine power
25
Lecture 1: Introduction
Training set X
X  {x t,r t }Nt1
 1 if x is positive
r
0 if x is negative


 x1 
x  
x 2 


26
Lecture 1: Introduction
Hypothesis class H
p1 
price  p2  AND e1  engine power  e2 

27
Lecture 1: Introduction
Empirical (training) error
 1 if h says x is positive
h(x)  
0 if h says x is negative
Empirical error of h on X:

N

E(h | X )  1 hx t  r t
t1

28
Lecture 1: Introduction

S, G, and the Version Space
most specific hypothesis, S
most general hypothesis, G
h  H, between S and G
is consistent [E( h | X) =
0] and make up the
version space
29
Lecture 1: Introduction
Margin

Choose h with largest margin
30
Lecture 1: Introduction
Noise
Unwanted anomaly in data



Imprecision in input attributes
Errors in labeling data points
Hidden attributes (relative to H)
Consequence:

No h in H may be consistent!
31
Lecture 1: Introduction
Noise and Model Complexity
Arguments for simpler model (Occam’s razor principle)
Easier to make predictions
Easier to train (fewer parameters)
Easier to understand
Generalizes better (if data is noisy)
1.
2.
3.
4.
32
Lecture 1: Introduction
Inductive Bias

Learning is an ill-posed problem



Training data is never sufficient to find a unique solution
There are always infinitely many consistent hypotheses
We need an inductive bias:

Assumptions that entail a unique h for a training set X
1.
Hypothesis class H – axis-aligned rectangles
2.
Learning algorithm – find consistent hypothesis with max-margin
Hyperparameters – trade-off between training error and margin
3.
33
Lecture 1: Introduction
Model Selection and Generalization

Generalization – how well a model performs on new data


34
Overfitting: H more complex than C
Underfitting: H less complex than C
Lecture 1: Introduction
Triple Trade-Off

Trade-off between three factors:
1.
2.
3.

Complexity of H, c(H)
Training set size N
Generalization error E on new data
Dependencies:


35
As N E
As c(H) first E and then E
Lecture 1: Introduction

Model Selection  Generalization Error

To estimate generalization error, we need data unseen
during training:
M

Eˆ  E(h | V)  1 hx t  r t
t1


V  {x t,r t }M
t1  X
Given models (hypotheses) h1, ..., hk induced from the
training set X, we can use E(h
 i | V ) to select the model
hi with the smallest generalization error
36
Lecture 1: Introduction
Model Assessment


To estimate the generalization error of the best model hi,
we need data unseen during training and model selection
Standard setup:
1.
2.
3.

Training set X (50–80%)
Validation (development) set V (10–25%)
Test (publication) set T (10–25%)
Note:


37
Validation data can be added to training set before testing
Resampling methods can be used if data is limited
Lecture 1: Introduction
Cross-Validation

K-fold cross-validation: Divide X into X1, ..., XK
V 1  X1 T 1  X 2  X 3    X K
V 2  X 2 T 2  X1  X 3    X K


Note:



38
V K  XK T
K
 X1  X 2    X K 1
Generalization error estimated by means across K folds
Training sets for different folds share K–2 parts
Separate test set must be maintained for model
assessment
Lecture 1: Introduction
Bootstrapping



Generate new training sets of size N from X by
random sampling with replacement
Use original training set as validation set (V = X )
Probability that we do not pick an instance after N
draws
N
 1
1
1


e
 0.368


 N
that is, only 36.8% of instances are new!
39
Lecture 1: Introduction
Measuring Error

Error rate = # of errors / # of instances = (FP+FN) / N

Accuracy = # of correct / # of instances = (TP+TN) / N

Recall = # of found positives / # of positives = TP / (TP+FN)

Precision = # of found positives / # of found = TP / (TP+FP)
40
Lecture 1: Introduction
Statistical Inference
Interval estimation to quantify the precision of our
measurements

m  1.96

N
Hypothesis testing to assess whether differences between
models are statistically significant


e01  e10 1
2
e01  e10
41
~ X12
Lecture 1: Introduction
Supervised Learning – Summary

Training data + learner  hypothesis


Test data + hypothesis  estimated generalization


Learner incorporates inductive bias
Test data must be unseen
Next lectures:

42
Different learners in LT with different inductive biases
Lecture 1: Introduction
Anatomy of a Supervised Learner
(Dimensions of a supervised machine learning
algorithm)



Model:
gx | q 


Loss function: E q | X    L r t ,gx t | q 
t

Optimization
q*  arg min E q | X 
q
procedure:


43
Lecture 1: Introduction
Reading


Alpaydin (2010): Ch. 1-2; 19 (mathematical underpinnings)
Witten and Frank (2005): Ch. 1 (examples and domains of
application)
44
Lecture 1: Introduction
End of Lecture 1
Thanks for your attention
45
Lecture 1: Introduction