Download Learning Energy-Based Models of High

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Inteligentne Systemy Autonomiczne
Uczenie w sieciach
Bayesa
W oparciu o wyklad Prof. Geoffrey Hinton
University of Toronto
Janusz A. Starzyk Wyzsza Szkola Informatyki i Zarzadzania w Rzeszowie
Paradygmat Bayesa
• Paradygmat Bayesa zakłada, że zawsze mamy
dystrybucję a priori dla wszystkiego.
– Dystrybucja a priori może być bardzo niejednoznaczna.
– Kiedy widzimy jakieś dane, łączymy naszą dystrybucję a
priori z warunkiem prawdopodobieństwa (likelihood) aby
otrzymać dystrybucję a posteriori.
– Warunek prawdopodobieństwa wylicza jak prawdopodobne
jest, że postrzegane dane są parametrami modelu.
• To sprzyja ustawieniom parametrow, które sprawiają, że
dane są bardziej prawdopodobne
• Walczy z dystrybucja a priori
• Z wystarczającą liczbą danych warunki
prawdopodobieństwa zawsze wygrywają.
Twierdzenie Bayesa
Prawdopodobieństwo
łączne
prawdopodobieńs
two warunkowe
p ( D) p (W | D)  p ( D,W )  p (W ) p ( D | W )
Prawdopodobieństwo
a priori wektora wag W
p (W | D) 
p (W )
Prawdopodobieństwo a
posteriori wektora W
przy danych
treningowych D
p( D | W )
p( D)
Prawdopodobieństwo
wystąpienia danych przy
wagach W – warunek
prawdopodobienstwa
 p(W ) p( D | W )
W
Dlaczego maksymalizujemy sumy
logarytmów prawdopodobienstw?
• Chcemy zmaksymalizować iloczyn prawdopodobieństw
danych wyjściowych dzięki sytuacjom treningowym
– Załozmy ze błędy danych wyjściowych w przypadku
różnych sytuacji treningowych, c, są niezależne.
p( D | W )   p(d c | W )
c
• Ponieważ funkcja logarytmu jest monotoniczna, dlatego
możemy maksymalizować sumy logarytmów
prawdopodobieństw.
log p( D | W )   log p(dc | W )
c
Maksymalizacja warunku prawdopodobieństwa
(maximum likelihood learning)
• Minimalizacja błędu sumy
kwadratow jest równoznaczna z
maksymalizacją logarytmów
prawdopodobieństwa poprawnej
odpowiedzi przy zalozeniu
rozkladu Gaussa wokol
zalozonego modelu.
yc  f (input c , W )
d=
y = szacunkowo
poprawna najbardziej
odpowiedź prawdopodobna
wartość
p (output  d c | input c , W )  p(d c | yc ) 
1
2  W
( d c  yc ) 2
 log p (output  d c | input c , W )  k 
2 W2

e
( d c  yc ) 2
2
2 W
Maksymalizacja warunku prawdopodobieństwa
(maximum likelihood learning ML)
• Znalezienie zbioru wag, W, które zminimalizują błąd sumy
kwadratow jest dokładnie tym samym co znalezienie takiego
W, które maksymalizuje logarytm prawdopodobieństwa tego
że model będzie dostarczał pożądanych wyjśc we
wszystkich sytuacjach treningowych.
– Domyślnie zakładamy że szum Gaussa o zerowej
sredniej jest dodany do aktualnych danych wyjściowych
modelu.
– Nie musimy znac poziomu szumu, ponieważ zakładamy,
że jest on ten sam we wszystkich przypadkach. Więc on
tylko skaluje błąd sumy kwadratow.
Twierdzenie Bayesa
Prawdopodobieństwo
łączne
prawdopodobieńs
two warunkowe
p ( D) p (W | D)  p ( D,W )  p (W ) p ( D | W )
Prawdopodobieństwo
a priori wektora wag W
p (W | D) 
p (W )
Prawdopodobieństwo a
posteriori wektora W
przy danych
treningowych D
p( D | W )
p( D)
Prawdopodobieństwo
wystąpienia danych przy
wagach W – warunek
prawdopodobienstwa
 p(W ) p( D | W )
W
Cost   log p(W | D)   log p(W )  log p( D | W )  log p( D)
Zasada maksymalnego prawdopodobieństwa a
posteriori (Maximum a posteriori learning MAP)
• To zamienia prawdopodobieństwo a priori parametrów
przez prawdopodobienstwo danych przy zadanych
parametrach. Szukane są parametry, które mają najlepszy
iloczyn pradopodobienstwa a priori i likelihood.
• Minimalizowanie sumy kwadratow wag jest równoznaczne
do maksymalizacji logarytmów prawdopodobieństw wag
przy rozkladzie Gaussa z zerowa srednia (maksymalizacja
a priori) .
w2
p( w) 
1
2 
p(w)
 log p( w) 
w
0

e
w2
2 W2
2
2 W
k
Zasada maksymalnego prawdopodobieństwa
a posteriori (MAP learning)
• Maksymalizacja prawdopodobieństwo a posteriori
Cost   log p(W | D)   log p(W )  log p( D | W )  log p( D)
jest równoznaczna z minimalizacja regularyzowanej
2
funkcji sumy kwadratów błędów
E    yi  di 
• z parametrem regularyzującym
 D2
 
 W2
i
2

• lub minimalizującą funkcji kosztow C  E 
D
 W2
p(w)
w
0
2
w
 i
i
Pytania?
The Bayesian Learning
Pelna wersja
The Bayesian framework
• The Bayesian framework assumes that we always
have a prior distribution for everything.
– The prior may be very vague.
– When we see some data, we combine our prior distribution
with a likelihood term to get a posterior distribution.
– The likelihood term takes into account how probable the
observed data is given the parameters of the model.
• It favors parameter settings that make the data likely.
• It fights the prior
• With enough data the likelihood terms always win.
A coin tossing example
• Suppose we know nothing about coins except that each
tossing event produces a head with some unknown
probability p and a tail with probability 1-p. Our model of
a coin has one parameter, p.
• Suppose we observe 100 tosses and there are 53
heads. What is p?
• The frequentist answer: Pick the value of p that makes
the observation of 53 heads and 47 tails most probable.
P( D)  p 53 (1  p) 47
probability of a particular sequence
dP( D)
 53 p 52 (1  p) 47  47 p 53 (1  p) 46
dp

 53 47  53
 p (1  p) 47
  
 p 1 p 
 0 if p  .53

Some problems with picking the parameters
that are most likely to generate the data
• What if we only tossed the coin once and we got
1 head?
– Is p=1 a sensible answer?
• Surely p=0.5 is a much better answer.
• Is it reasonable to give a single answer?
– If we don’t have much data, we are unsure
about p.
– Our computations of probabilities will work
much better if we take this uncertainty into
account.
Using a distribution over parameter values
• Start with a prior distribution
over p. In this case we used a
uniform distribution.
probability
density
area=1
0
• Multiply the prior probability of
each parameter value by the
probability of observing a head
given that value.
• Then scale up all of the
probability densities so that
their integral comes to 1. This
gives the posterior distribution.
1
p
probability
density
1
1
2
probability
density
area=1
Lets do it again: Suppose we get a tail
2
• Start with a prior
distribution over p.
• Multiply the prior
probability of each
parameter value by the
probability of observing a
tail given that value.
• Then renormalize to get
the posterior distribution.
Look how sensible it is!
probability
density
1
area=1
0
p
area=1
1
Lets do it another 98 times
• After 53 heads and 47
tails we get a very
sensible posterior
distribution that has its
peak at 0.53 (assuming a
uniform prior).
area=1
2
probability
density
1
0
p
1
Bayes Theorem
conditional
probability
joint probability
p ( D) p (W | D)  p ( D,W )  p (W ) p ( D | W )
Probability of observed
data given W –
likelihood function
Prior probability of
weight vector W
p (W | D) 
Posterior probability
of weight vector W
given training data D
p (W )
p( D | W )
p( D)
 p(W ) p( D | W )
W
A cheap trick to avoid computing the
posterior probabilities of all weight vectors
• Suppose we just try to find the most probable
weight vector.
– We can do this by starting with a random
weight vector and then adjusting it in the
direction that improves p( W | D ).
• It is easier to work in the log domain. If we want
to minimize a cost we use negative log
probabilities:
p(W | D) 
p(W )
p( D | W ) / p( D)
Cost   log p(W | D)   log p(W )  log p( D | W )  log p( D)
Why we maximize sums of log probs
• We want to maximize the product of the probabilities of
the outputs on the training cases
– Assume the output errors on different training cases,
c, are independent.
p( D | W )   p(d c | W )
c
• Because the log function is monotonic, so we can
maximize sums of log probabilities
log p( D | W )   log p(dc | W )
c
A even cheaper trick
• Suppose we completely ignore the prior over
weight vectors
– This is equivalent to giving all possible weight
vectors the same prior probability density.
• Then all we have to do is to maximize:
log p( D | W )   log p( Dc | W )
c
• This is called maximum likelihood learning. It is
very widely used for fitting models in statistics.
Supervised Maximum Likelihood Learning
• Minimizing the squared
residuals is equivalent to
maximizing the log
probability of the correct
answer under a Gaussian
centered at the model’s
guess.
yc  f (input c , W )
d = the
y = model’s
correct
answer
estimate of most
probable value
p (output  d c | input c , W )  p(d c | yc ) 
1
2  W
( d c  yc ) 2
 log p (output  d c | input c , W )  k 
2 W2

e
( d c  yc ) 2
2
2 W
Supervised Maximum Likelihood (ML) Learning
• Finding a set of weights, W, that minimizes the
squared errors is exactly the same as finding a W
that maximizes the log probability that the model
would produce the desired outputs on all the
training cases.
– We implicitly assume that zero-mean Gaussian
noise is added to the model’s actual output.
– We do not need to know the variance of the
noise because we are assuming it’s the same
in all cases. So it just scales the squared error.
Bayes Theorem
conditional
probability
joint probability
p ( D) p (W | D)  p ( D,W )  p (W ) p ( D | W )
Probability of observed
data given W likelihood function
Prior probability of
weight vector W
p (W | D) 
Posterior probability
of weight vector W
given training data D
p (W )
p( D | W )
p( D)
 p(W ) p( D | W )
W
Cost   log p(W | D)   log p(W )  log p( D | W )  log p( D)
Maximum A Posteriori (MAP) Learning
• This trades-off the prior probabilities of the parameters
against the probability of the data given the parameters.
It looks for the parameters that have the greatest
product of the prior term and the likelihood term.
• Minimizing the squared weights is equivalent to
maximizing the log probability of the weights under a
zero-mean Gaussian (maximizing prior) .
2
p( w) 
p(w)
w
0

1
2 
 log p( w) 
e
w2
2 W2
w
2
2 W
k
Maximum A Posteriori (MAP) Learning
• Maximizing posterior probabilities
Cost   log p(W | D)   log p(W )  log p( D | W )  log p( D)
is equivalent to minimizing the regularized sum of
squares error function
2
E    yi  di 
i
 D2
• with a regularization parameter  
 W2
• or minimizing the cost function
 D2
2
C  E 
w

i
 W2 i
p(w)
w
0
Full Bayesian Learning
• Instead of trying to find the best single setting of the
parameters (as in ML or MAP) compute the full posterior
distribution over parameter settings
– This is extremely computationally intensive for all but
the simplest models (its feasible for a biased coin).
• To make predictions, let each different setting of the
parameters make its own prediction and then combine
all these predictions by weighting each of them by the
posterior probability of that setting of the parameters.
– This is also computationally intensive.
• The full Bayesian approach allows us to use complicated
models even when we do not have much data
Overfitting: A frequentist illusion?
• If you do not have much data, you should use a
simple model, because a complex one will overfit.
– This is true. But only if you assume that fitting a
model means choosing a single best setting of
the parameters.
– If you use the full posterior over parameter
settings, overfitting disappears!
– With little data, you get very vague predictions
because many different parameters settings
have significant posterior probability
A classic example of overfitting
• Which model do you believe?
– The complicated model fits the
data better.
– But it is not economical and it
makes silly predictions.
• But what if we start with a reasonable
prior over all fifth-order polynomials
and use the full posterior distribution.
– Now we get vague and sensible
predictions.
• There is no reason why the amount of
data should influence our prior beliefs
about the complexity of the model.
Approximating full Bayesian learning in a
neural network
• If the neural net only has a few parameters we could put
a grid over the parameter space and evaluate p( W | D )
at each grid-point.
– This is expensive, but it does not involve any gradient
descent and there are no local optimum issues.
• After evaluating each grid point we use all of them to
make predictions on test data
– This is also expensive, but it works much better than
ML learning when the posterior is vague or
multimodal (this happens when data is scarce).
p(dtest | inputtest ) 
g
p(d


grid
test
| inputtest , Wg ) p(Wg | D)
An example of full Bayesian learning
• Allow each of the 6 weights or biases to have
the 9 possible values [-2 : 0.5 : 2]
– So there are 96 grid-points in parameter
space.
• For each grid-point compute the probability of
the observed outputs of all the training cases.
– This is the likelihood term and is explained
on the next slide
• Multiply the prior for each grid-point p(Wi) by the
likelihood term and renormalize to get the
posterior probability for each grid-point p(Wi,D).
• Make predictions p(ytest| input, D) by using the
posterior probabilities of all grid-points to
average the predictions p(ytest| input, Wi) made
by the different grid-points.
bias
bias
A neural net with 2
inputs, 1 output
and 6 parameters
Computing the likelihood term for a logistic
output unit
• The output of the logistic unit is the probability
that the network assigns to the answer 1. It
assigns the complementary probability to the
answer 0.
Compute
if d=1
if d=0
yc  f (input c , Wg )
p(output  d c | input c , Wg )  d c yc  (1  d c )(1  yc )
The likelihood
p(all training outputs | Wg )   p(output  d c | input c , Wg )
c
What can we do if there are too many
parameters for a grid to be feasible?
• The number of grid points is exponential in the number
of parameters.
– So we cannot deal with more than a few parameters
using a grid.
• If there is enough data to make most parameter vectors
very unlikely, only need a tiny fraction of the grid points
make a significant contribution to the predictions.
– Maybe we can just evaluate this tiny fraction
• It might be good enough to just sample weight vectors
according to their posterior probabilities.
p( ytest | inputtest , D)   p(Wi | D) p( ytest | inputtest , Wi )
i
Sample weight vectors
with this probability
One method for sampling weight vectors
• In standard backpropagation we keep moving the weights in the
direction that decreases the cost
– i.e. the direction that increases the log likelihood plus the log prior,
summed over all training cases.
• Suppose we add some Gaussian noise to the weight vector after each
update.
– So the weight vector never settles down.
– It keeps wandering around, but it tends to prefer low cost regions
of the weight space.
• Amazing fact: If we use just the right amount of noise, and if we let the
weight vector wander around for long enough before we take a
sample, we will get a sample from the true posterior over weight
vectors.
– This is called a “Markov Chain Monte Carlo” method and it makes
it feasible to use full Bayesian learning with hundreds or thousands
of parameters.
– There are related MCMC methods that are more complicated but
more efficient (we don’t need to let the weights wander around for
so long before we get samples from the posterior).