Download Bayesian Concept Learning and Application for a Simple Language

Document related concepts

History of network traffic models wikipedia , lookup

History of statistics wikipedia , lookup

Mathematical model wikipedia , lookup

Dragon King Theory wikipedia , lookup

Dirichlet-multinomial distribution wikipedia , lookup

Exponential family wikipedia , lookup

Negative binomial distribution wikipedia , lookup

Transcript
Bayesian Concept Learning
and
Application for a Simple Language Modeling
Reference: Machine Learning – A Probabilistic Perspective
by Kevin Murphy
1
Introduction
• Consider how a child learns the meaning of a word e.g. “dog”
• Child’s parents point out positive examples of this concept.
•
•
e.g. “look at the cute dog!”
unlikely that they provide negative examples, by saying “look at that non‐dog”
• Think of learning as equivalent to concept learning, which in turn is equivalent to binary classification
2
Introduction
• Define if is an example of the concept , and otherwise.
• The goal is to learn indicator function , which defines which elements are in the set .
• Consider an example called the number game.
• Choose some simple arithmetical concept .
•
e.g. “prime number”
• Then a series of randomly chosen positive examples drawn from , and ask you whether some test case belongs to .
3
Introduction
• Suppose all numbers are integers between 1 and 100 and tell you “16” is a positive example of the concept.
•
•
What other numbers do you think are positive? 17? 32?
Your predictions will be vague with only one example.
• Numbers that are similar in some sense to 16 are more likely. But similar in what way? •
17 is similar because it is “close by”, 32 is similar because it is even and power of 2.
• Some numbers are more likely than others and .
represent as probability distribution, 4
Introduction
is the probability that given that data for any and called the posterior predictive distribution.
• Suppose tell you that 8, 2 and 64 are also positive examples. •
5
Introduction
is the probability that given that data for any and called the posterior predictive distribution.
• Suppose tell you that 8, 2 and 64 are also positive examples. •
•
•
You may guess the hidden concept is “power of two”.
The predictive distribution is specific and puts most of its mass on powers of 2.
• If instead tell you the data is you will get a different predictive distribution.
, 6
Introduction
Empirical predictive distribution averaged over 8 humans in number game. First two rows illustrate diffuse similarity. Third row illustrates rule‐like behavior (powers of 2). Bottom row illustrates focussed similarity (numbers near 20).
7
Introduction
• Suppose a hypothesis space of concepts, , •
E.g. odd numbers, even numbers, powers of two, etc.
• Subset of that is consistent with the data is called the version space. The version space shrinks and become certain about the concept. • After seeing , there are many consistent rules
• How do you combine them to predict if 8
Introduction
• After seeing , why did you choose the rule “power of two” and not others, which are equally consistent with the evidence?
• Why ”power of two” is more likely to be chosen and not ”even numbers”, given that both are consistent with the evidence?
9
Likelihood
• Key intuition is to avoid suspicious coincidences. • If the true concept was even number, how came only saw numbers that to be powers of two?
• Define extension of a concept as the set of number that belong to it
• e.g. The extension of is 2, 4, 6, … , 98, 100
• Assume examples are sampled uniformly at random from the extension of a concept.
10
Likelihood
• The probability of independently sampling items (with replacement) from is given by
1
1
• Let since there are only 6 powers of two less than 100; since there are 50 even numbers.
• After 4 examples, likelihood of is likelihood of is This is a likelihood ratio of almost 5000:1 in favor of 11
Likelihood
• The maximum likelihood estimate or MLE:
≜ argma
argmax log
• Consider again • Since
and (powers of 2) 12
Prior
• Suppose Concept “powers of two except 32” is more likely than “powers of two”, since 32 is missing from the set of examples.
• Hypothesis “powers of two except 32” seems “conceptually unnatural”. •
We can capture such intuition by assigning low prior probability to unnatural concepts.
• Although the subjectivity of prior is controversial, it is quite useful. 13
Prior
•
Use a simple prior which puts uniform probability on 30 simple arithmetical concepts. •
•
•
E.g. “even numbers”, “prime numbers”, etc.
Make the concepts even and odd more likely apriori.
Include two “unnatural” concepts: “powers of 2, plus 37” and “powers of 2, except 32”, give them low prior weight. 14
Posterior
• The posterior is the likelihood times the prior, normalized.
∈
Ⅱ
∑
∈
,
∑
∈
Ⅱ
/
∈
/
where Ⅱ ∈ is 1 iff (if and only if) all the data are in the extension of the hypothesis .
• The posterior is combination of prior and likelihood.
• If the prior is uniform, the posterior is proportional to the likelihood. 15
Posterior
Prior, likelihood and posterior for 16 . • The “unnatural” concepts of “powers of 2, plus 37” and “powers of 2, except 32” have low posterior support, due to the low prior. • The concept of odd numbers has low posterior support, due to the low likelihood. 16
Posterior
• The MAP estimate can be written as
argm
argmax log
• MLE: • MAP: 17
Posterior
Prior, likelihood and posterior for 16, 8, 2, 64 . • The likelihood is more peaked on the powers of two concept and dominates the posterior. 18
Posterior
• MLE: • MAP: 19
Posterior
• The learner figures out the true concept.
•
The need for low prior on the unnatural concepts. Otherwise we overfit the data and picked “powers of 2, except for 32”. • The posterior becomes peaked on a single concept when enough data, namely MAP estimate,
where argmax
is the posterior mode, and where is the Dirac measure defined by
1if ∈
0 if ∉
20
Posterior Predictive Distribution
• Posterior is internal belief state about the world.
• Test if the beliefs are justified is to use them to predict objectively observable quantities.
• The posterior predictive distribution is given by
•
This is weighted average of predictions of each individual hypothesis and is called Bayes model averaging.
21
Posterior Predictive Distribution
Posterior over hypothesis and the corresponding predictive distribution after seeing 16 . A dot means this number is consistent with hypothesis. Graph on the right is the weight given to hypothesis . By taking a weighed sum of dots, we get ∈
.
22
Posterior Predictive Distribution
• Dots at the bottom show the predictions from each hypothesis; vertical curve on the right shows the weight associated with each hypothesis.
•
If multiply each row by its weight and add up, we get the distribution at top.
• When have a small/ambiguous dataset, posterior is vague, which induces a broad predictive distribution. 23
Posterior Predictive Distribution
• Once we have “figured things out”, posterior becomes a delta function centered at the MAP estimate.
• The predictive distribution becomes
This is called a plug‐in approximation to predictive density and is widely used, due to its simplicity. This under‐represents our uncertainty, and our predictions will not be as “smooth” as using BMA.
24
Posterior Predictive Distribution
• Although MAP learning is simple, it cannot explain the gradual shift from similarity‐based reasoning (uncertain posteriors) to rule‐based reasoning (certain posteriors).
• Suppose observe • If use simple prior, minimal consistent hypothesis is “all powers of 4”, only 4 and 16 get a non‐zero probability of being predicted.  Example of overfitting
25
Posterior Predictive Distribution
• Given the MAP hypothesis is “all powers of two”. The plug‐in predictive distribution gets broader/stays the same as see more data:
•
It starts narrow, but is forced to broaden as it sees more data.
26
Posterior Predictive Distribution
• In Bayesian approach, we start broad and then narrow down as learn more, which makes more intuitive sense. •
•
Given 16 ,there are many hypotheses with non‐
negligible posterior support, the predictive distribution is broad. When 16, 8, 2, 64 , the posterior concentrates its mass on one hypothesis, the predictive distribution becomes narrower.
• Predictions made by plug‐in approach and Bayesian approach are different in small sample regime, although converge to same answer as more data.
27
Beta‐binomial Model
Introduction
• The number game involved inferring a distribution over a discrete variable drawn from a finite hypothesis space, given a series of discrete observations.
•
The computations is simple: sum, multiple and divide.
• Suppose the unknown parameters are continuous. • So the hypothesis space is of , where K is number of parameters. This complicates mathematics since we have to use integrals. But the basic ideas are same.
28
Beta‐binomial Model
Likelihood
• Considering the problem of inferring the probability that a coin shows up heads, given a series of observed coin tosses.
• Suppose , where represents “heads”, represents “tails”, is the rate parameter (probability of heads). • If the data are iid, the likelihood has the form
∑ Ⅱ
∑ Ⅱ
where 1 heads and 0
tails. These two counts called sufficient statistics of the data, since this is all we need about to infer . (An alternative set of that are and
.
29
Beta‐binomial Model
Likelihood: Sufficient Statistics
•
is a sufficient statistic for data if • If use a uniform prior, is equivalent to .
• If we have two datasets with the same sufficient statistics, we will infer same value for .
30
Beta‐binomial Model
Likelihood
•
Suppose data consists of the count of the number of heads observed in a fixed number of trials. We have ~Bin , , where Bin represents binomial distribution, which has the pmf:
Bin
•
•
,
≜
1
is a constant independent of , the likelihood for binomial sampling model is the same as likelihood for the Bernoulli model. Any inferences made about will be same whether observe the count, , , or sequence of trails, ,…,
.
31
Beta‐binomial Model
Prior
• We need a prior which has support over interval It would convenient if prior had the same form as likelihood:
for prior parameters .
• We evaluate the posterior by simply adding up: 32
Beta‐binomial Model
Prior
• When prior and posterior have the same form, the prior is conjugate prior for corresponding likelihood.
•
Conjugate priors are widely used because they simplify computation and easy to interpret.
• For the Bernoulli, the conjugate prior is the beta distribution:
• The parameters of prior called hyper‐parameters.
• We can set them in order to encode our prior beliefs.
• E.g. To encode that has mean 0.7 and standard deviation 0.2, we set a = 2.975 and b = 1.275.
33
Beta Distribution
• The beta distribution has support over the interval and is defined as follows:
•
where is the beta function,
is the gamma function:
34
Beta Distribution
• We require to ensure the distribution is integrable (i.e. ensure exists).
, we get the uniform distribution.
• If • If and are both less than 1, we get a bimodal distribution with “spikes” at 0 and 1;
• If and are both greater than 1, the distribution is unimodal. 35
Beta Distribution
Some beta distribution. 36
Beta‐binomial Model
Prior
• If know “nothing” about except it lies in we can use uniform prior, which is a kind of uninformative prior. • The uniform distribution can be represented by beta distribution with 37
Beta‐binomial Model
Posterior
• If multiply the likelihood by beta prior, the posterior:
∝ Bin
,
Beta
,
Beta
,
• The posterior is obtained by adding prior hyper‐
parameters to empirical counts.
• The hyper‐parameters are known as pseudo counts.
• The strength of prior (effective sample size of prior) is the sum of pseudo counts, a + b; • This plays a role analogous to data set size, 38
Beta‐binomial Model
Posterior
Update a weak Beta(2,2) prior with Binomial likelihood with sufficient statistics 3,
17 to yield a Beta(5,19) posterior.
 The posterior is essentially identical to likelihood, since data has overwhelmed the prior.
39
Beta‐binomial Model
Posterior
Update a strong Beta(5,2) prior with Binomial likelihood with sufficient statistics 11,
13 to yield a Beta(16,15) posterior.
 The posterior is “compromise” between the prior and likelihood.
40
Beta‐binomial Model
Posterior
• Updating the posterior sequentially is equivalent to updating in a single batch.
• Suppose two data sets with sufficient Let statistics and be sufficient statistics of the combined datasets. In batch mode, ,
∝
,
,
,
∝
41
Beta‐binomial Model
Posterior
• In sequential mode, 42
Beta‐binomial Model
Posterior mean and mode
•
The MAP estimate is given by
1
2
•
If use a uniform prior, then MAP estimate reduces to MLE, which is the empirical fraction of heads:
It can also be derived by applying elementary calculus to maximize the likelihood function, 1
.
• The posterior mean is given by:
̅
• The difference between mode and mean will prove important later.
43
Beta‐binomial Model
Posterior predictive distribution
• We focused on inference of unknown parameter(s). Now turn to prediction of future observable data.
• Consider predicting the probability of heads in a single future trial under Beta
posterior.
1
1
,
|
• The mean of posterior predictive distribution is equivalent to plugging in posterior mean | .
parameters: Ber
44
Beta‐binomial Model
Overfitting and the black swan paradox
• Suppose instead we plug‐in MLE, i.e., we use • This approximation can perform poorly when sample size is small.
• suppose we have 3 trials in a row. MLE is 0,
since this makes observed data as probable as possible.
• We predict heads are impossible. This is called zero count problem/ sparse data problem, occurs when estimating counts from small amounts of data.
45
Beta‐binomial Model
Overfitting and the black swan paradox
• Note that once partition the data based on certain criteria, E.g. number of times a specific person has engaged in a specific activity, •
•
•
the sample sizes can become smaller.
Problem arises, e.g. when try to perform personalized recommendation of web pages. Bayesian methods are useful even in big data regime.
46
Beta‐binomial Model
Overfitting and the black swan paradox
• Zero‐count problem is analogous to problem in philosophy called black swan paradox.
•
This is based on ancient Western conception that all swans were white. Black swan was a metaphor for something could not exist. (First discovered in Australia in 17th Century.)
• “black swan paradox” was used to illustrate the problem of induction, which is how to draw general conclusions about future from specific observations from the past. 47
Beta‐binomial Model
Overfitting and the black swan paradox
Let’s derive a simple Bayesian solution to it. We use a uniform prior, 1. Plugging in the posterior mean gives Laplace’s rule of succession:
1
1
2
• This justifies common practice of adding 1 to empirical counts, normalizing and then plugging them in, a technique known as add‐one smoothing.
• Plugging in MAP parameters would not have this smoothing , which effect, since mode has the form becomes MLE if 1.
•
48
Dirichlet‐multinomial Model
Introduction
• In previous, we discussed how to infer the probability that a coin comes up heads. • Now, we generalize these results to infer the probability that a dice with sides comes up as face 49
Dirichlet‐multinomial Model
Likelihood
• Suppose we observe dice rolls, where Assume the data is iid, likelihood has the from: where is the probability of the label appears and ∑ Ⅱ
is number of times event k occurred (sufficient statistics for model). • The likelihood for multinomial model has the same form, up to an irrelevant constant factor.
50
Dirichlet‐multinomial Model
Prior
• Since parameter vector lives in K‐dimensional probability simplex, we need a prior has support over this simplex. It would be conjugate ideally.  The Dirichlet distribution satisfies both criteria.
• So we use Dirichlet prior . 51
Dirichlet‐multinomial Model
Dirichlet Distribution
•
A multivariate generalization of beta distribution is Dirichlet
distribution, which has support over the probability simplex, defined by:
1, ∑
x: 0
• The pdf is defined as:
Dir x
≜
1
1
Ⅱ x∈
• Returning to the dice rolls, the Dirichlet prior:
Dir θ
1
Ⅱ x∈
52
Dirichlet‐multinomial Model
Dirichlet Distribution
The Dirichlet distribution when 3 defines a distribution over the simplex, which can be represented by triangular surface. Points on this surface satisfy 0
1and ∑
1.
53
Dirichlet‐multinomial Model
Dirichlet Distribution
Plot of the Dirichlet density when 54
Dirichlet‐multinomial Model
Dirichlet Distribution
55
Dirichlet‐multinomial Model
Dirichlet Distribution
0.1, 0.1, 0.1 . The comb‐like structure on edges is a plotting artifact.
56
Dirichlet‐multinomial Model
Dirichlet Distribution
Samples from a 5‐dimensional symmetric Dirichlet distribution for different parameter values. 0.1, … , 0.1 . This results in very sparse distribution, with many 0s.
57
Dirichlet‐multinomial Model
Dirichlet Distribution
1, … , 1 . This results in more uniform (and dense) distribution.
58
Dirichlet‐multinomial Model
Posterior
• Multiplying the likelihood by the prior, the posterior is also Dirichlet:
∝
∏
∝ ∏
Dir
,…,
The posterior is obtained by adding the prior hyper‐parameters (pseudo‐counts) to the empirical count .
59
Dirichlet‐multinomial Model
Posterior predictive
• The posterior predictive distribution for a single multinoulli
trial is given by:
,
|
Where ∑
are all components of except .
• This expression avoids the zero‐count problem. •
This form of Bayesian smoothing is more important in multinomial case than the binary case, since the likelihood of data sparsity increases once partition the data into many categories.
60
Simple Language Model
Using Bag of Words
• Application of Bayesian smoothing using the Dirichlet‐multinomial model is language modeling.
 which predicts which words might occur next in sequence.
• Assume the ’th word, , is sampled independently from all the other words using distribution. This is called bag of words model.
• Given a past sequence of words, how can we predict which one likely to come next?
61
Simple Language Model
Using Bag of Words
• Suppose observe the following sequence: Mary had a little lamb, little lamb, little lamb, Mary had a little lamb, its fleece as white as snow
• Suppose our vocabulary has following words:
Mary lamb little big fleece white black snow rain unk
1
2 3 4 5 6 7 8 9 10
unk stands for unknown, and represents all other words do not appear elsewhere on the list.
• To encode each line of the sequence, first strip off punctuation, remove any stop words, e.g. “a”, “as”, “the”, etc. 62
Simple Language Model
Using Bag of Words
• Perform stemming, which means reducing words to their base form.
•
•
•
E.g. stripping off the final s in plural words
Or ing from verbs, e.g. running  run
No need in this example.
• Replace each word by its index into the vocabulary:
1 10 3 2 3 2 3 2
1 10 3 2 10 5 10 6 8
• Ignore the word order and count how often each word occurred, resulting in a histogram of word counts.
63
Simple Language Model
Using Bag of Words
Token
1
Word
Count
•
2
3
4
5
6
mary lamb
little
big
fleece
2
4
0
1
4
7
9
10
white black snow
rain
unk
1
0
4
0
Denote the counts by . If use a Dir
posterior predictive is:
|
• Set 1,
8
1
prior for , the 1
10
∑
,
,
,
,
,
,
17
,
,
,
The modes of predictive distribution are 2(“lamb”) and 10(“unk”). Note that words “big”, “black”, “rain” are predicted to occur with non‐zero probability in future, even though they never been seen before.
64