Download 18.4 Evaluating and Choosing the Best Hypothesis Model selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene expression programming wikipedia , lookup

Machine learning wikipedia , lookup

Concept learning wikipedia , lookup

Philosophy of artificial intelligence wikipedia , lookup

Ethics of artificial intelligence wikipedia , lookup

History of artificial intelligence wikipedia , lookup

Pattern recognition wikipedia , lookup

Intelligence explosion wikipedia , lookup

Existential risk from artificial general intelligence wikipedia , lookup

Transcript
336
18.4 Evaluating and Choosing
the Best Hypothesis
• We assume that there is a probability distribution over examples
that remains stationary over time
• Each observed value is sampled from that distribution and is
independent of previous examples and
• Each example has identical prior probability distribution
• Examples that satisfy these assumptions are called independent
and identically distributed (i.i.d.)
• The error rate of a hypothesis h is the proportion of mistakes it
makes
• The proportion of times that h(x) y for an (x, y) example
• Just because a hypothesis h has low error rate on the training set
does not mean that it will generalize well
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
337
Model selection: Complexity vs. goodness of fit
• We can think of finding the best hypothesis as two tasks:
• Model selection defines the hypothesis space and
• Optimization finds the best hypothesis within that space
• How to select among models that are parameterized by size
• With polynomials we have size = 1 for linear functions,
size = 2 for quadratics, and so on
• For decision trees, the size could be the number of nodes in
the tree
• We want to find the value of the size parameter that best
balances underfitting and overfitting to give the best test set
accuracy
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
1
338
• A wrapper takes a learning algorithm as an argument (DT
learning for example)
• The wrapper enumerates models according to the size parameter
• For each size, it uses cross validation (say) on the learner to
compute the average error rate on training and test sets
• We start with the smallest, simplest models (which probably
underfit the data), and iterate, considering more complex models
at each step, until the models start to overfit
• The cross validation picks the value of size with the lowest
validation set error
• We then generate a hypothesis of that size using all the data
(without holding out any of it; eventually we should evaluate the
returned hypothesis on a separate test set)
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
339
From error rates to loss
• Consider the problem of classifying emails as spam or non-spam
• It is worse to classify non-spam as spam than to classify spam as
non-spam
• So a classifier with a 1% error rate, where almost all errors were
classifying spam as non-spam, would be better than a classifier
with only a 0.5% error rate, if most of those errors were
classifying non-spam as spam
• Utility is what learners – like decision makers – should maximize
• In machine learning it is traditional to express utilities by means
of loss functions
• The loss function L(x, y, ) is defined as the amount of utility lost
by predicting h(x) = when the correct answer is f(x) = y:
L(x, y, ) = U(result of using y given an input x)
– U(result of using given an input x)
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
2
340
• Often a simplified version of the loss function is used: It is 10
times worse to classify non-spam as spam than vice-versa:
L(spam, nonspam) = 1, L(nonspam, spam) = 10
• Note that L(y, y) is always zero
• In general for real-valued data small errors are better than large
ones
• Two functions that implement that idea are the absolute value of
the difference (called the L1 loss), and the square of the
difference (called the L2 loss)
• Minimizing error rate is formulated in the L0/1 loss function
Absolute value loss:
Squared error loss:
0/1 loss:
Department of Software Systems
L1(y, ) = |y– |
L2(y, ) = (y– )2
L0/1(y, ) = 0 if y = , else 1
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
341
• Let P(X, Y) be a prior probability distribution over examples
• Let E be the set of all possible input-output examples
• Then the expected generalization loss for a hypothesis h (w.r.t.
loss function L) is
GenLossL(h) = (x,y) E L(y, h(x)) P(x, y)
• The best hypothesis h* is the one with the minimum expected
generalization loss
h* = arg minh H GenLossL(h)
• Because P(x, y) is not known, the learning agent can only
estimate generalization loss with empirical loss on the set of
examples E:
EmpLossL,E(h) = (1/n) (x,y) E L(y, h(x))
• The estimated best hypothesis * is then the one with minimum
empirical loss:
* = arg minh H EmpLossL,E(h)
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
3
342
Regularization
• Earlier on we did model selection with cross-validation on model
size
• An alternative approach is to search for a hypothesis that directly
minimizes the weighted sum of empirical loss and the complexity
of the hypothesis, which we call the total cost
Cost(h) = EmpLoss(h) + Complexity(h)
* = arg minh H Cost(h)
• Here is a parameter, a positive number that serves as a
conversion rate between loss and hypothesis complexity
• We still need to do cross-validation search to find the hypothesis
that generalizes best, but this time with different values of
• This process of explicitly penalizing complex hypotheses is called
regularization
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
343
18.5 The Theory of Learning
• Computational Learning Theory (COLT) [Valiant 1984]
• Valiant received the ACM Turing Award in 2010
• Any hypothesis that is seriously wrong will almost certainly be
“found out” with high probability after a small number of
examples, because it will make an incorrect prediction
• Thus, any hypothesis that is consistent with a sufficiently large
set of training examples is unlikely to be seriously wrong: that is,
it must be probably approximately correct (PAC)
• PAC-learning is based on the stationarity assumption: Future
examples are going to be drawn from the same fixed distribution
P(E) = P(X, Y) as past examples
• We do not necessarily know the distribution
• Without this connection between the past and the future, learning
would be practically impossible
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
4
344
• Let X be the instance space from which the examples are drawn
from with a stationary distribution
• H is the hypothesis class and N denotes the number of training
examples
• We assume that the true function f is a member of H
• The error rate of a hypothesis h is defined as the expected
generalization error for examples drawn from the stationary
distribution
error(h) = GenLossL0/1(h)
= (x,y) L0/1(y, h(x)) P(x, y)
• A hypothesis h is called approximately correct if error(h)
where is a small constant
Department of Software Systems
,
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
345
• Let us examine a hypothesis hb, which is not an -approximation
of the true function
• Then error(hb) > and the probability that it agrees with a given
example is at most 1 –
• Since the examples are independent, the bound for N examples
is
P(hb agrees with N examples) (1 – )N
• The probability that the hypothesis class H contains at least one
such hypothesis is at most |H|(1 – )N
• We would like to reduce the probability of this event below some
small number
|H|(1 – )N
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
5
346
• Then
ln
1
ln
+ ln
1
• Because ln(1 + ) < , we can achieve this if we allow the
algorithm to see
N (1/ )(ln|H| + ln(1/ ))
examples
• Thus, if a learning algorithm returns a hypothesis that is
consistent with this many examples, then with probability at least
1 – , it has error at most
• In other words, it is probably approximately correct
• The number of required examples, as a function of and , is
called the sample complexity of the hypothesis space
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
347
Learning decision lists
• A decision list consists of a series of tests, each of which is a
conjunction of literals
• The tests (rules) are attempted in order
• If a test succeeds, the decision list specifies the value to be
returned
• If a test fails, processing continues with the next test in the list
Patrons =
Some
No
Full &
Fri/Sat
Yes
Yes
Department of Software Systems
No
No
Yes
Yes
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
6
348
• Decision lists resemble decision trees, but their overall structure
is simpler: they branch only in one direction
• In contrast, the individual tests are more complex
• Decision lists generalize decision trees
• If we allow tests of arbitrary size, then decision lists can represent
any Boolean function
• On the other hand, if we restrict the size of each test to at most k
literals (k-DL), then it is possible for the learning algorithm to
generalize successfully from a small number of examples
• Language k-DL includes as a subset the language k-DT, the set
of all decision trees of depth at most k
• Let k-DL(n) denote a k-DL language using n Boolean attributes
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
349
• Let Conj(n, k) denote the conjunctions of at most k literals using
n attributes
• The number of conjunctions of k literals from n attributes is
given by
) =
• In a decision list each conjunction can be attached to either a
Yes or No outcome or can be absent from the decision list
• Hence, there are at most 3|Conj(n,k)| distinct sets of component
tests
• Each of these sets of tests can be in any order, so
|k-DL(n)| 3|Conj(n,k)||Conj(n,k)|!
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
7
350
• Combining the above estimates yields
|k-DL(n)| 2O(nk log2(nk))
• We plug this into the formula for sample complexity to show that
the number of examples needed for PAC-learning is
N (1/ )(|O(nk log2(nk))| + ln(1/ ))
• Hence, a k-DL function is PAC-learnable from a number of
examples polynomial in n
• Therefore, any algorithm that returns a consistent decision list will
PAC-learn a k-DL function in a reasonable number of examples,
for small k
• The following greedy algorithm will output a consistent decision
list (if one exists)
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
351
Algorithm DL-learner(S)
Input: S training examples
Output: a decision list or failure
1. if S = Ø then return the trivial decision list No;
2. Choose a test t that matches a subset of examples St Ø such
that the members are all positive or negative;
3. if there is no such t then fail;
4. if all examples in St are positive
then o
Yes
else o
No;
5. L
DL-learner(S\St);
6. return a decision list with initial test t and outcome o and
remaining tests given by L;
Department of Software Systems
OHJ-2556 Artificial Intelligence, Spring 2013
4 April 2013
8