Download Natural Language Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
Natural Language Processing
Introduction to Probability
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
[email protected]
Natural Language Processing
1(11)
Probability and Statistics
“Once upon a time, there was a . . . ”
Natural Language Processing
2(11)
Probability and Statistics
“Once upon a time, there was a . . . ”
I
Can you guess the next word?
I
Hard in general, because language is not deterministic
I
But some words are more likely than others
Natural Language Processing
2(11)
Probability and Statistics
“Once upon a time, there was a . . . ”
I
Can you guess the next word?
I
Hard in general, because language is not deterministic
I
But some words are more likely than others
I
We can model uncertainty using probability theory
I
We can use statistics to ground our models in empirical data
Natural Language Processing
2(11)
The Mathematical Notion of Probability
I
The probability of A, P(A), is a real number between 0 and 1:
1. If P(A) = 0, then A is impossible (never happens)
2. If P(A) = 1, then A is necessary (always happens)
3. If 0 < P(A) < 1, then A is possible (may happen)
Natural Language Processing
3(11)
The Mathematical Notion of Probability
I
The probability of A, P(A), is a real number between 0 and 1:
1. If P(A) = 0, then A is impossible (never happens)
2. If P(A) = 1, then A is necessary (always happens)
3. If 0 < P(A) < 1, then A is possible (may happen)
I
A is an event in a sample space ⌦
I
I
I
Sample space = all possible outcomes of an “experiment”
Event = a subset of the sample space
Events can be described as a variable taking a certain value
1. {w 2 ⌦ | w is a noun} , PoS = noun
2. {s 2 ⌦ | s consists of 8 words} , #Words = 8
Natural Language Processing
3(11)
Logical Operations on Events
I
Often we are interested in combinations of two or more events
I
This can be represented using set theoretic operations
Assume a sample space ⌦ and two events A and B:
I
1.
2.
3.
4.
Complement A (also A0 ) = all elements of ⌦ that are not in A
Subset A ✓ B = all elements of A are also elements of B
Union A [ B = all elements of ⌦ that are in A or B
Intersection A \ B = all elements of ⌦ that are in A and B
Natural Language Processing
4(11)
Venn Diagrams
Natural Language Processing
5(11)
Axioms of Probability
I
I
P(A) = The probability of event A
Axioms:
1. P(A) 0
2. P(⌦) = 1
3. If A and B are disjoint, then P(A [ B) = P(A) + P(B)
Natural Language Processing
6(11)
Probability of an Event
I
If A is an event and {x1 , . . . , xn } its individual outcomes, then
P(A) =
n
X
P(xi )
i=1
I
I
Assume all 3-letter strings are equally probable
What is the probability of a string of three vowels?
Natural Language Processing
7(11)
Probability of an Event
I
If A is an event and {x1 , . . . , xn } its individual outcomes, then
P(A) =
n
X
P(xi )
i=1
I
I
Assume all 3-letter strings are equally probable
What is the probability of a string of three vowels?
1.
2.
3.
4.
5.
There are 26 letters, of which 6 are vowels
There are N = 263 3-letter strings
There are n = 63 3-vowel strings
Each outcome (string) is equally with P(xi ) =
So, a string of three vowels has probability
63
P(A) = Nn = 26
3 ⇡ 0.012
Natural Language Processing
1
N
7(11)
Rules of Probability
I
Theorems:
1.
2.
3.
4.
If A and A are complementary events, then P(A) = 1
P(;) = 0 for any sample space ⌦
If A ✓ B, then P(A)  P(B)
For any event A, 0  P(A)  1
Natural Language Processing
P(A)
8(11)
Addition Rule
I
Axiom 3 allows us to add probabilities of disjoint events
I
What about events that are not disjoint?
Natural Language Processing
9(11)
Addition Rule
I
Axiom 3 allows us to add probabilities of disjoint events
I
What about events that are not disjoint?
I
Theorem: If A and B are events in ⌦, then
P(A [ B) = P(A) + P(B) P(A \ B)
I
I
A = “has glasses”, B = “is blond”
P(A) + P(B) counts blondes with glasses twice
Natural Language Processing
9(11)
Quiz 1
I
I
Assume that the probability of winning in a lottery is 0.01
What is the probability of not winning?
1. 0.01
2. 0.99
3. Impossible to tell
Natural Language Processing
10(11)
Quiz 2
I
I
Assume that A and B are events in a sample space ⌦
Which of the following could possibly hold:
1. P(A [ B) < P(A \ B)
2. P(A [ B) = P(A \ B)
3. P(A [ B) > P(A \ B)
Natural Language Processing
11(11)
Natural Language Processing
Joint, Conditional and Marginal Probability
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
[email protected]
Natural Language Processing
1(11)
Conditional Probability
I
Given events A and B in ⌦, with P(B) > 0, the conditional
probability of A given B is:
P(A|B) =DEF
I
P(A \ B)
P(B)
P(A \ B) or P(A, B) is the joint probability of A and B.
I
I
I
The prob that a person is rich and famous – joint
The prob that a person is rich if they are famous – conditional
The prob that a person is famous if they are rich – conditional
Natural Language Processing
2(11)
Conditional Probability
P(A) = size of A relative to ⌦
P(A, B) = size of A \ B relative to ⌦
P(A|B) = size of A \ B relative to B
Natural Language Processing
3(11)
Example
I
I
We sample word bigrams (pairs) from a large text T
Sample space and events:
I
I
I
I
Probabilities:
I
I
I
I
⌦ = {(w1 , w2 ) 2 T } = the set of bigrams in T
A = {(w1 , w2 ) 2 T | w1 = run} = bigrams starting with run
B = {(w1 , w2 ) 2 T | w2 = amok} = bigrams ending with amok
P(run1 ) = P(A) = 10 3
P(amok2 ) = P(B) = 10 6
P(run1 , amok2 ) = (A, B) = 10
7
Probability of amok following run? Of run preceding amok?
Natural Language Processing
4(11)
Example
I
I
We sample word bigrams (pairs) from a large text T
Sample space and events:
I
I
I
I
Probabilities:
I
I
I
I
⌦ = {(w1 , w2 ) 2 T } = the set of bigrams in T
A = {(w1 , w2 ) 2 T | w1 = run} = bigrams starting with run
B = {(w1 , w2 ) 2 T | w2 = amok} = bigrams ending with amok
P(run1 ) = P(A) = 10 3
P(amok2 ) = P(B) = 10 6
P(run1 , amok2 ) = (A, B) = 10
7
Probability of amok following run? Of run preceding amok?
10 7
= 0.1
10 6
10 7
=
0.0001
10 3
I
P(run before amok) = P(A|B) =
I
P(amok after run) = P(B|A) =
Natural Language Processing
4(11)
Multiplication Rule for Joint Probability
I
Given events A and B in ⌦, with P(B) > 0:
P(A, B) = P(B)P(A|B)
I
Since A \ B = B \ A, we also have:
P(A, B) = P(A)P(B|A)
I
The multiplication rule is also known as the chain rule
Natural Language Processing
5(11)
Quiz 1
I
The probability of winning the Nobel Prize if you have a PhD
in Physics is 1 in a million [P(A|B) = 0.000001]
I
Only 1 in 10,000 people have a PhD in Physics [P(B) = 0.0001]
What is the probability of a person both having a PhD in
Physics and winning the Nobel Prize? [P(A, B) = ?]
I
1. Smaller than 1 in a million
2. Greater than 1 in a million
3. Impossible to tell
Natural Language Processing
6(11)
Marginal Probability
I
Marginalization, or the law of total probability
I
If events B1 , . . . , Bk constitute a partition of the sample space
⌦ (and P(Bi ) > 0 for all i), then for any event A in ⌦:
P(A) =
k
X
i=1
I
P(A, Bi ) =
k
X
P(A|Bi )P(Bi )
i=1
Partition = pairwise disjoint and B1 [ · · · [ Bk = ⌦
Natural Language Processing
7(11)
Joint, Marginal and Conditional
I
Joint probabilities for rain and wind:
no rain
light rain
heavy rain
Natural Language Processing
no wind
0.1
0.05
0.05
some wind
0.2
0.1
0.1
strong wind
0.05
0.15
0.1
storm
0.01
0.04
0.05
8(11)
Joint, Marginal and Conditional
I
Joint probabilities for rain and wind:
no rain
light rain
heavy rain
I
no wind
0.1
0.05
0.05
some wind
0.2
0.1
0.1
strong wind
0.05
0.15
0.1
storm
0.01
0.04
0.05
Marginalize to get simple probabilities:
I
I
P(no wind) = 0.1 + 0.05 + 0.05 = 0.2
P(light rain) = 0.05 + 0.1 + 0.15 + 0.04 = 0.34
Natural Language Processing
8(11)
Joint, Marginal and Conditional
I
Joint probabilities for rain and wind:
no rain
light rain
heavy rain
I
some wind
0.2
0.1
0.1
strong wind
0.05
0.15
0.1
storm
0.01
0.04
0.05
Marginalize to get simple probabilities:
I
I
I
no wind
0.1
0.05
0.05
P(no wind) = 0.1 + 0.05 + 0.05 = 0.2
P(light rain) = 0.05 + 0.1 + 0.15 + 0.04 = 0.34
Combine to get conditional probabilities:
I
I
P(no wind|light rain) =
P(light rain|no wind) =
Natural Language Processing
0.05
0.34
0.05
0.2
= 0.147
= 0.25
8(11)
Bayes Law
I
Given events A and B in sample space ⌦:
P(A|B) =
I
I
I
P(A)P(B|A)
P(B)
Follows from definition using chain rule
Allows us to “invert” conditional probabilities
Denominator can be computed using marginalization:
P(B) =
k
X
i=1
I
P(B, Ai ) =
k
X
P(B|Ai )P(Ai )
i=1
Special case of partition: P(A), P(A)
Natural Language Processing
9(11)
Independence
I
Two events A and B are independent if and only if:
P(A, B) = P(A)P(B)
I
Equivalently:
P(A) = P(A|B)
P(B) = P(B|A)
Natural Language Processing
10(11)
Independence
I
Two events A and B are independent if and only if:
P(A, B) = P(A)P(B)
I
Equivalently:
P(A) = P(A|B)
P(B) = P(B|A)
I
Example:
I
I
I
I
P(run1 ) = P(A) = 10 3
P(amok2 ) = P(B) = 10 6
P(run1 , amok2 ) = (A, B) = 10
7
A and B are not independent
Natural Language Processing
10(11)
Quiz 2
I
Research has shown that people with disease D exhibit
symptom S with 0.9 probability
I
A doctor finds that a patient has symptom S
What can we conclude about the probability that the patient
has disease D
I
1. The probability is 0.1
2. The probability is 0.9
3. Nothing
Natural Language Processing
11(11)
Natural Language Processing
Statistical Inference
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
[email protected]
Natural Language Processing
1(12)
Statistical Inference
I
I
Inference from a finite set of observations (a sample) to a
larger set of unobserved instances (a population or model)
Two main kinds of statistical inference:
1. Estimation
2. Hypothesis testing
I
In natural language processing:
I
I
Estimation – learn model parameters (probability distributions)
Hypothesis tests – assess statistical significance of test results
Natural Language Processing
2(12)
Random Variables
I
A random variable is a function X that partitions the sample
space ⌦ by mapping outcomes to a value space ⌦X
I
The probability function can be extended to variables:
P(X = x) = P({! 2 ⌦ | X (!) = x})
I
I
Examples:
1. The part-of-speech of a word X : ⌦ ! {noun, verb, adj, . . .}
2. The number of words in a sentence Y : ⌦ ! {1, 2, 3, . . .}.
When we are not interested in particular values, we write P(X )
Natural Language Processing
3(12)
Expectation
I
The expectation E [X ] of a (discrete) numerical variable X is:
X
E [X ] =
x · P(X = x)
x2⌦X
I
Example: The expectation of the sum Y of two dice:
E [Y ] =
12
X
y =2
Natural Language Processing
y · P(Y = y ) =
252
=7
36
4(12)
Entropy
I
The entropy H[X ] of a discrete random variable X is:
X
H[X ] = E [ log2 P(X )] =
P(X = x) log2 P(X = x)
x2⌦X
I
Entropy can be seen as the expected amount of information
(in bits), or as the difficulty of predicting the variable
I
I
Sum of two dice:
P12
11-sided die (2–12):
Natural Language Processing
y =2
P(Y = y ) log2 P(Y = y ) ⇡ 3.27
P12
1
z=2 11
log2
1
11
⇡ 3.46
5(12)
Quiz 1
I
Let X be a random variable that map (English) words to the
number of characters they concern
I
I
For example, X (run) = 3 and X (amok) = 4
Which of the following statements do you think are true:
1. P(X = 0) = 0
2. P(X = 5) < P(X = 50)
3. E [X ] < 50
Natural Language Processing
6(12)
Statistical Samples
I
A random sample of a variable X is a vector (X1 , . . . , XN ) of
independent variables Xi with the same distribution as X
I
I
I
I
I
It is said to be i.i.d. = independent and identically distributed
In practice, it is often hard to guarantee this
Observations may not be independent (not i.)
Distribution may be biased (not i.d.)
What is the intended population?
I
I
A Harry Potter novel is a good sample of J.K. Rowling, or
fantasy fiction, but not of scientific prose
This is relevant for domain adaptation in NLP
Natural Language Processing
7(12)
Estimation
I
Given a random sample of X , we can define sample variables,
such as the sample mean:
N
1 X
X =
Xi
N
i=1
I
Sample variables can be used to estimate model parameters
(population variables)
1. Point estimation: use variable X to estimate parameter
2. Interval estimation: use variables Xmin and Xmax to construct
an interval such that P(Xmin < < Xmax ) = p, where p is the
confidence level adopted
Natural Language Processing
8(12)
Maximum Likelihood Estimation (MLE)
I
Likelihood of parameters ✓ given sample x1 , . . . , xN :
L(✓|x1 , . . . , xN ) = P(x1 , . . . , xN |✓) =
I
N
Y
i=1
P(xi |✓)
Maximum likelihood estimation – choose ✓ to maximize L:
max L(✓|x1 , . . . , xN )
✓
I
Basic idea:
I
I
A good sample should have a high probability of occurring
Thus, choose the estimate that maximizes sample probability
Natural Language Processing
9(12)
Examples
I
Sample mean is an MLE of expectation:
Ê [X ] = X
I
I
For example, estimate expected sentence length in a certain
type of text by mean sentence length in a representative sample
Relative frequency is an MLE of probability:
P̂(X = x) =
I
f (x)
N
For example, estimate the probability of a word being a noun
by the relative frequency of nouns in a suitable corpus
Natural Language Processing
10(12)
MLE for Different Distributions
I
Joint distribution of X and Y :
P̂MLE (X = x, Y = y ) =
I
f (x, y )
N
Marginal distribution of X :
P̂MLE (X = x) =
X
P̂MLE (X = x, Y = y )
y 2⌦Y
I
Conditional distribution of X given Y :
P̂MLE (X = x|Y = y ) =
=
Natural Language Processing
P̂MLE (X =x,Y =y )
P̂MLE (Y =y )
P
P̂MLE (X =x,Y =y )
P̂MLE (X =x,Y =y )
x2⌦X
11(12)
Quiz 2
I
Consider the following sample of English words:
{once, upon, a, time, there, was, a, frog }
I
What is the MLE of word length (number of characters) based
on this sample?
1. 4
2. 8
3. 3.25
Natural Language Processing
12(12)