Download Foundations of Data Mining

Document related concepts

Perceptual learning wikipedia , lookup

Neural modeling fields wikipedia , lookup

Artificial intelligence wikipedia , lookup

Recurrent neural network wikipedia , lookup

Learning wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Time series wikipedia , lookup

Transcript
Foundations of Data Mining
Session 1: Introduction
Joaquin Vanschoren
Mykola Pechenizkiy
Eindhoven University of Technology
Anne Driemel
Course Overview
Course Details I
Lecturers
• Joaquin Vanschoren ([email protected]) MF 7.104a
• Mykola Pechenizkiy ([email protected]) MF 7.099
• Anne Driemel ([email protected]) MF 7.073
Contact hours
• Mondays, 9:30 - 10:30: Questions and Answers (PAV J17)
• Mondays, 10:45 - 12:30: Plenary Lectures (PAV J17)
• Thursdays, 13:45 - 15:30: Plenary Lectures (AUD 16)
More info:
• http://www.win.tue.nl/⇠jvanscho/#!datamining
Course Details II
Materials: all lecture slides, assignments posted to Canvas
• http://canvas.win.tue.nl/courses/5
• No set course book, but recommended reading for each lecture
Evaluation
• No exam, only assignments (3 problem sets)
• Problem set 1: available Feb 1
• 3 parts: deadlines Feb 18, Feb 25, Mar 3 (noon)
• Problem set 2: available Feb 29
• Problem set 3: available Mar 21
• Required knowledge is discussed in lectures
• Work in teams of 2 students, rotated between problem sets
• Passing grade: 6/10 over all assignments
Learning objectives
• Understand how data mining algorithms algorithms work
• Reason about when and how to use them, and apply them
successfully in practice
• Understand the mathematical/statistical foundations of data
mining techniques
• Run practical experiments to experience first-hand how data
mining algorithms behave on real data.
• Explore how algorithm parameters and data properties a↵ect
the e↵ectiveness of predictive models
• Recognize and formulate data analysis problems
Topics I
• Similarities and Distances
• Clustering
• Dimensionality Reduction
• Using Machine Learning software (R, Python)
• Rules and decision trees
• Evaluation and optimization
• Instance-based Learning
• Kernel methods
• Ensemble Learning
• Neural Networks
• Bayesian Learning
Topics II
Not covered:
• Reinforcement Learning
• Genetic Algorithms (only briefly)
• Handeling complex input data
• Text, Images, Sensors,...
• We study the core such algorithms, not how data is represented
• Itemset Mining
• Association Rule Mining
• Outlier detection
Lecture Overview
Lecture Overview
• Prologue: What is machine learning?
• How can we learn? The problem of induction.
• Simple strategies
• No free lunch theorem
• Overfitting
• The 5 ‘tribes’ of machine learning
•
•
•
•
Symbolic Learning: Express, manipulate symbolic knowledge
Neural Networks (Connectionism): Mimick the Human brain
Evolution: Simulate the evolutionary process
Probabilistic (Bayesian) Inference: Reduce uncertainties by
incorporating new evidence
• Learning by Analogy: Recognize similarities between old and
new situations
Prologue
Machine Learning is changing our world
• Search engines learn what you want
• Recommenders learn your taste in books, music, movies,...
• Algorithms do automatic stock trading
• Elections are won by understanding voters
• Google Translate learns how to translate text
• Siri learns to understand speech
• DeepMind beats humans at Go
• Cars drive themselves
• Medicines are developed faster
• Smartwatches monitor your health
• Data-driven discoveries are made in Physics, Biology,
Genetics, Astronomy, Chemistry, Neurology,...
What’s in a name?
• Many names: data mining, data science, machine learning,
statistical learning,...
• Subtle di↵erences in scope, partly marketing
• We’ll mostly use the term ’machine learning’
• Has deep roots in statistics, neurology, biology, psychology,...
but has developed into a new field of study.
• How is it di↵erent from statistics?
• Breiman. Statistical Modeling: The Two Cultures. Statistical
Science, 2001
Two cultures: Classical statistics
x
System
y
Assume that one can sufficiently describe the unknown system by a
stochastic model with given parametric function class for f ():
y = f (x, ✓) + ✏
Assuming that the model is correct, we can do hypothesis tests,
variance analysis, model comparison, confidence intervals,...
Two cultures: machine learning I
x
System
y
Approximation by
neuronal network, tree, ...
The system is considered as unknown, explicit modelling is not
even tried. Essential measure of goodness is the prediction quality.
Two cultures: machine learning II
• In statistical literature, many articles start with ”‘Assume that
the data are generated by the following model . . . ”’.
• Advantages: Data model can be interpreted if the model is
easy enough, has good theory for model diagnostics
• Disadvantages (Breiman 2001): “Irrelevant” theory, doesn’t
consider many interesting problems
• In machine learning, algorithms are studied based on example
from nature, intuitive behavior or computational attractivity.
• Advantages: Many more types of models are available, quicker
implementation of new ideas.
• Disadvantages: Models mostly hard to interpret (“black box”)
• Bluntly, statistics tries to help Human understand the system,
machine learning tries to replace Human faculties.
• Today, both fields are learning a lot from each other.
Machine learning tasks
• Supervised Learning
• Learn the relationship between ”input” x and ”output” y :
search for a function f , such that y ⇡ f (x)
• There is training data with labels available
Regression: y is metric variable (with values in R)
Classification: y is categorical variable (unordered, discrete).
• Semi-supervised learning: also uses available unlabeled data,
e.g. assumes that similar inputs have similar outputs.
• Unsupervised Learning
• There exist no outputs, search for patterns within the inputs x
Clustering: find groups of similar items
Dimensionality reduction: describe data in fewer features
Outlier detection: what is out of the ordinary?
Association rules: which things often happen together?
How can we learn?
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
Some terminology
Rows: Instances, Examples (labelled/unlabelled)
Columns: Factors, Features, Attributes
Last column: Target feature
First column: Identifier (Never give this to the learner!)
Other columns: Predictive features
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Is there one factor that perfectly predicts the answer?
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Is there one factor that perfectly predicts the answer?
• What about a conjunction of factors?
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Is there one factor that perfectly predicts the answer?
• What about a conjunction of factors?
• Warm & Weekend ! No date today :(
• Club & TV Bad ! Date :)
To date or not to date?
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Is there one factor that perfectly predicts the answer?
• What about a conjunction of factors?
•
•
•
•
Warm & Weekend ! No date today :(
Club & TV Bad ! Date :)
Club & Warm ! No date...
Weekend & TV Bad ! Date...
• There’s no way to know!
Hume’s problem of induction (David Hume, 1748)
How can we ever be justified in generalizing from what we’ve seen
to what we haven’t?
• You have no basis to pick one generalization over the other
Hume’s problem of induction (David Hume, 1748)
How can we ever be justified in generalizing from what we’ve seen
to what we haven’t?
• You have no basis to pick one generalization over the other
• Big data (Casanova-approach) won’t help: answer may
depend on a factor you didn’t consider
• What if the answer is just random?
Hume’s problem of induction
What if we just assume that the future will be like the past?
• Risky assumption (e.g. inductivist turkey)
• Still the best we can do (and generally seems to work)
• Even then, this only helps if we have seen the exact same
situation before
• The machine learning problem remains: How do we generalize
from cases that we haven’t seen before?
• What if somebody types a unique Google query?
• What if a patient comes in with slightly di↵erent symptoms?
• What if someone writes a new unique spam email?
• Even with all the data in the world, your chances or finding
the exact same case are almost zero. We need induction.
No Free Lunch Theorem (David Wolpert, 1977)
No Free Lunch Theorem
If all functions f (x) = y are equally likely, all algorithms that aim
to optimize that function have identical performance.
• Sets a limit on how good a learner can be: no learner can be
better than random guessing!
• But then why is the world full of highly successful learners?
No Free Lunch Theorem (David Wolpert, 1977)
No Free Lunch Theorem
If all functions f (x) = y are equally likely, all algorithms that aim
to optimize that function have identical performance.
• Sets a limit on how good a learner can be: no learner can be
better than random guessing!
• But then why is the world full of highly successful learners?
• For every world where a learner does better than random
guessing, we can construct an anti-world by flipping the labels
of all unseen instances: it performs worse by the same amount
No Free Lunch Theorem (David Wolpert, 1977)
No Free Lunch Theorem
If all functions f (x) = y are equally likely, all algorithms that aim
to optimize that function have identical performance.
• Sets a limit on how good a learner can be: no learner can be
better than random guessing!
• But then why is the world full of highly successful learners?
• For every world where a learner does better than random
guessing, we can construct an anti-world by flipping the labels
of all unseen instances: it performs worse by the same amount
• We don’t care about all possible worlds, only the one we live in
• We assume we know something about this world that gives us
an advantage. That knowledge is fallible, but it’s a risk we’ll
have to take
The futility of bias-free learning
Practical consequence: There’s no such thing as learning without
knowledge (assumptions). Data alone is not enough.
• We need to provide prior knowledge to the algorithm, or make
assumptions when constructing hypotheses
• The structure of a neural net, a Bayesian prior, background
knowledge as rules, the way a tree represents knowledge,...
• These assumptions are called a learner’s bias, i.e. bias-free
learning is impossible.
• Every new piece of knowledge is the basis for more knowledge.
• What assumptions can we start from that are not too strong?
Newton’s Principle of induction: Whatever is true of everything
we’ve seen, is true for everything in the universe
• We induce the most widely applicable rules we can, and
reduce scope only when the data forces us to.
Learning = Representation + Evaluation + Optimization
All learners consist of three main components (all introducing
bias):
• Representation: A model must be represented in a formal
language that the computer can handle.
• Defines the concepts it can learn: The hypothesis space
• Evaluation: How to choose one hypothesis over the other?
• The evaluation function, objective function, scoring function
• Can di↵er from the external evaluation function (e.g. accuracy)
• Optimization: How do we search the hypothesis space?
• Key to the efficiency of the learner
• Defines how many optima it finds
• Often starts from most simple hypothesis, relaxing it if needed
to explain the data
A dating algorithm
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Representation: conjunctions of factors (conjunctive concepts)
• Optimization: start with best 1-factor concept, then add best
other factor on remaining data
• Don’t try all combinations (combinatorial explosion)!
• Evaluation: exclude most bad matches and fewest good ones
A dating algorithm
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
Yes
No
?
• Representation: conjunctions of factors (conjunctive concepts)
• Optimization: start with best 1-factor concept, then add best
other factor on remaining data
• Don’t try all combinations (combinatorial explosion)!
• Evaluation: exclude most bad matches and fewest good ones
• Result: Weekend ^ Warm (! No date)
• Thoughts?
Learning sets of rules (Michalski)
• Real concepts are disjunctive ! sets of rules
• Credit card used in 3 di↵erent continents yesterday ! stolen
• Credit card used twice after 23:00 on weekday ! stolen
• Credit card used to buy 1 dollar of gas ! stolen
Learning sets of rules (Michalski)
• Real concepts are disjunctive ! sets of rules
• Credit card used in 3 di↵erent continents yesterday ! stolen
• Credit card used twice after 23:00 on weekday ! stolen
• Credit card used to buy 1 dollar of gas ! stolen
• Divide and conquer approach: build rule covering as many
positive examples as possible, discard all positive examples
that it covers, repeat until all are covered
• Sets of rules can represent any concept. How?
Learning sets of rules (Michalski)
• Real concepts are disjunctive ! sets of rules
• Credit card used in 3 di↵erent continents yesterday ! stolen
• Credit card used twice after 23:00 on weekday ! stolen
• Credit card used to buy 1 dollar of gas ! stolen
• Divide and conquer approach: build rule covering as many
positive examples as possible, discard all positive examples
that it covers, repeat until all are covered
• Sets of rules can represent any concept. How?
• Just turn each positive instance into a rule using all factors
• Weekend ^ Club ^ Warm ^ Bad ! Yes
• 100% accurate rule? Or an illusion?
Learning sets of rules (Michalski)
• Real concepts are disjunctive ! sets of rules
• Credit card used in 3 di↵erent continents yesterday ! stolen
• Credit card used twice after 23:00 on weekday ! stolen
• Credit card used to buy 1 dollar of gas ! stolen
• Divide and conquer approach: build rule covering as many
positive examples as possible, discard all positive examples
that it covers, repeat until all are covered
• Sets of rules can represent any concept. How?
• Just turn each positive instance into a rule using all factors
• Weekend ^ Club ^ Warm ^ Bad ! Yes
• 100% accurate rule? Or an illusion?
• Every new (unseen) example will be negative
• No free lunch: you can’t learn without assuming anything
Overfitting
• Learner finds a pattern in the data that is not actually true in
the real world: overfits the data
• Humans also overfit when they overgeneralize from an
incomplete picture of the world
• Every powerful learner can ‘hallucinate’ patterns
• Happens when you have too many hypotheses and not enough
data to tell them apart
• The more data, the more ’bad’ hypotheses are eliminated
• If the hypothesis space is not constrained, there may never be
enough data
• There is often a parameter that allows you to constrain
(regularize) the learner
Overfitting in rule learning
Overfitting rule learner
Non-overfitting rule learner
classif.JRip: P=TRUE; E=TRUE
Train: mmce=0.02; CV: mmce.test.mean=0.085
classif.JRip:
Train: mmce=0.04; CV: mmce.test.mean=0.08
2.5
2.5
0.0
0.0
1
classes
x.2
x.2
classes
1
2
−2.5
2
−2.5
−5.0
−5.0
−4
−2
0
x.1
2
−4
−2
0
2
x.1
Better training set performance Better test set performance
(seen examples)
(unseen examples)
Another example
Overfitting rule learner
Non-overfitting rule learner
classif.ksvm: fit=FALSE; kernel=rbfdot; C=1; sigma=100
Train: mmce=0.01; CV: mmce.test.mean=0.215
classif.ksvm: fit=FALSE; kernel=rbfdot; C=1; sigma=1
Train: mmce=0.045; CV: mmce.test.mean=0.055
2.5
2.5
0.0
0.0
1
classes
x.2
x.2
classes
1
2
−2.5
2
−2.5
−5.0
−5.0
−4
−2
0
x.1
2
−4
−2
0
2
x.1
Better training set performance Better test set performance
(seen examples)
(unseen examples)
Overfitting and noise
• Overfitting is seriously exacerbated by noise (errors in the
training data)
• Unconstrained learner will model that noise
• A popular misconception is that overfitting is always caused
by noise
Overfitting and noise
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Good
Good
Bad
No
Yes
Yes
No
?
• We misremembered occasion 3
• If we model this instance, the resulting rule will be worse
• Better to ignore a few instances that trying to get all correct
Overfitting and noise
Nr
Day of Week
Type of Date
Weather
TV Tonight
Date?
1
2
3
4
Now
Weekday
Weekend
Weekend
Weekend
Weekend
Dinner
Club
Club
Club
Club
Warm
Warm
Warm
Cold
Cold
Bad
Bad
Bad
Good
Bad
No
Yes
No
No
?
• There was another reason your friend said no that has nothing
to do with you
• You are better o↵ misclassifying it
• Worse: it is now impossible to find any consistent set of rules
Overfitting in regression models
In regression, likewise, the complexity of the considered models
needs to be limited. More data also helps here.
Good model
Overfitting
Underfitting
Avoiding Overfitting
• You should never believe your model until you’ve verified it on
data that the learner didn’t see
• Scientific method applied to machine learning: model must
make new predictions that can be experimentally verified
• Randomly divide the data into:
• Training set which you give to the learner
• Test set which you hide to verify predictive performance
• A learner can do this internally to stop short of perfectly
fitting the data
• Do a statistical significance test to see whether one hypothesis
is significantly better than another
• Throw out enough low-significance hypotheses
• Prefer simpler hypotheses altogether (e.g. divide-and-conquer,
regularization)
Avoiding overfitting
Model complexity can often be controlled by an algorithm
(regularization) parameter. Optimizing it, by tracking the actual
(test set) error, is always recommended.
Overfitting
Error
Underfitting
Actual error
Apparent error
Complexity
A note on Occam’s razor
Occam’s razor
Prefer the simplest hypothesis that fits all the data
• This does NOT reduce overfitting (popular misconception)
• Simpler models are preferable for other reasons (e.g.
computational and cognitive cost)
Bias-Variance analysis
• Overfitting can be better understood by decomposing the test
set error into
• Bias: Learner’s tendency to consistently misclassify certain
instances (underfitting)
• Variance: Learner’s tendency to learn random things
irrespective of the real signal (overfitting)
• Noise: Intrinsic error, independent from the learner
• This can be done by comparing predictions after training the
model on many random samples of the training data.
Bias-Variance analysis
Bias-Variance analysis
# Reduce underfitting.
Make model more flexible
(or choose another).
Try Boosting.
Reduce overfitting. Make model less
flexible (regularization), or add more
data. Try Bagging.
The 5 ‘tribes’ of Machine Learning
The 5 paradigms
Rival schools of thought in machine learning, each with core
beliefs, and with distinct strategy to learn anything:
• Symbolic Learning: Express, manipulate symbolic knowledge
• Rules and trees
• Neural Networks (Connectionism): Mimick the Human brain
• Neural Nets
• Evolution: Simulate the evolutionary process
• Genetic algorithms
• Probabilistic (Bayesian) Inference: Reduce uncertainties by
incorporating new evidence
• Graphical models, Gaussian processes
• Learning by Analogy: Recognize similarities between old and
new situations
• kNN, Support Vector Machines
Meant as overview: Concepts are explained later in the course
Symbolic Learning
• All intelligence can be reduced to manipulating symbols
• Can incorporate preexisting knowledge (e.g. as rules)
• Can combine knowledge, data, to fill in gaps (like scientists)
Representation
Rules, trees, first order logic rules
Evaluation
Accuracy, information gain
Optimization
Top-down induction, inverse deduction
Algorithms
Decision trees, Logic programs
Symbolic Learning: Trees
2.5
Petal.Width
2
1.5
1
0.5
0
1
2
3
4
Petal.Length
5
6
7
Symbolic Learning: Trees
1
Petal.Length
p < 0.001
≤ 1.9
> 1.9
3
Petal.Width
p < 0.001
≤ 1.7
> 1.7
4
Petal.Length
p < 0.001
≤ 4.8
> 4.8
Node 2 (n = 50)
Node 5 (n = 46)
1
1
1
1
0.8
0.8
0.8
0.8
0.6
0.6
0.6
0.6
0.4
0.4
0.4
0.4
0.2
0.2
0.2
0.2
0
0
0
setosa
setosa
Node 6 (n = 8)
Node 7 (n = 46)
0
setosa
setosa
Symbolic Learning
Deduction
Socrates is human + humans are mortal = ?
Inverse deduction
Socrates is human + ? = Socrates is mortal
• Background knowledge (e.g. gene interactions, metabolic
pathways) can be expressed in first-order knowledge
• Inverse deduction can infer new hypotheses
• Robot scientist: learns hypotheses, then designs and runs
experiments to test hypotheses
Neural Networks
• Learning is what the brain does: reverse-engineer it
• Adjust strengths of connection between neurons
• Can handle raw, high-dimensional data, constructs it own
features
Representation
Neural network
Evaluation
Squared error
Optimization
Gradient descent
Algorithms
Backpropagation
Neural Networks
• Hebbian learning: Neurons that fire together, wire together
• Backpropagation: assigns ’blame’ for errors to neurons earlier
in the network
• Many applications (deep learning).
Evolution
• Natural selection is the mother of all learning
• Simulate evolution on a computer
• Can learn structure, e.g. the shape of a brain
Representation
Genetic programs (often trees)
Evaluation
Fitness function
Optimization
Genetic search
Algorithms
Genetic programming (crossover, mutation)
Probabilistic (Bayesian) Learning
• Learning is a form of uncertain inference
• Uses Bayes’ theorem to incorporate new evidence into our
beliefs
• Can deal with noisy, incomplete, contradictory data
Representation
Graphical models, Markov networks
Evaluation
Posterior probability
Optimization
Probabilistic inference
Algorithms
Bayes’ theorem and derivates
Bayesian Learning
• Choose hypothesis space + prior for each hypothesis
• As evidence comes in, update probability of each hypothesis
• Posterior: how likely is hypotheses after seeing the data
Learning by Analogy
• Learning is recognizing similarities between situations and
inferring other similarities
• Generalizes from similarity
• Transfer solution from previous situations to new situations
Representation
Memory, support vectors
Evaluation
Margin
Optimization
Kernel machines
Algorithms
Nearest Neighbor, Support Vector Machines
Nearest neighbors
• Given cities belonging to 2 countries. Where is the border?
• Nearest neighbor: point belongs to closest cities
• k-Nearest neighbor: do vote over k nearest ones (smoother)
Support Vector Machines (Kernel methods)
• Only remember points that define border (support vectors)
• Find linear border with maximal margin to nearest points "
• If not linearly separable, transform the input space (kernel
trick)
Further Reading
• P. Domingos. A Few Useful Things to Know about Machine
Learning
• P. Domingos. The Master Algorithm
• G. James et al. An Introduction to Statistical Learning
• P. Flach. Machine Learning