Download Binomial distribution Binomial distribution Special case: Bernoulli

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Binomial distribution
Number of heads when a coin with heads bias p ∈ [0, 1] is tossed n times:
binomial distribution
S ∼ Bin(n, p) .
Probability mass function: for any k ∈ {0, 1, 2, . . . , n},
!
n k
P(S = k) =
p (1 − p)n−k .
k
Binomial distribution
0.1
Pr[S=k]
0.08
0.06
0.04
0.02
0
0
5
10
15
20
1 / 14
Special case: Bernoulli distribution
35
40
45
50
2 / 14
Let X1 , X2 , . . . , Xn be i.i.d. Bern(p) random variables, and let S ∼ Bin(n, p).
Then S has the same distribution as X1 + X2 + · · · + Xn .
Bernoulli distribution
X ∼ Bern(p) = Bin(1, p)
Mean: By linearity of expectation,


n
n
X
X
E(S) = E
Xi  =
E(Xi ) = np .
P(X = 0) = 1 − p .
i=1
Mean:
E(X) = P(X = 0) · 0 + P(X = 1) · 1 = p .
i=1
Variance: Since X1 , X2 , . . . , Xn are independent,


n
n
X
X
var(S) = var
Xi  =
var(Xi ) = np(1 − p) .
Variance:
h
2 i
var(X) = E X − E(X)
= p(1 − p) .
(Standard deviation is
30
Binomial = sums of i.i.d. Bernoullis
The outcome of a coin toss with heads bias p ∈ [0, 1]:
P(X = 1) = p ,
25
k
i=1
p
var(X); more convenient to use than E |X − E(X)| .)
3 / 14
i=1
4 / 14
Test error rate
Deviations from the mean
Let fˆ: X → Y be a classifier, and suppose you have i.i.d. test data T
(that are independent of fˆ); let n := |T |.
Question: What are the “typical” values (i.e., non-tail event) of
S ∼ Bin(n, p)?
True error rate (with (X, Y ) ∼ P):
0.08
err(fˆ) = P(fˆ(X) 6= Y ) .
0.07
0.06
1
err(fˆ, T ) =
n
X
(x,y)∈T
Pr[S=k]
Test error rate:
1{fˆ(x) 6= y} .
0.05
0.04
0.03
0.02
The random variables {1{fˆ(x) 6= y}}(x,y)∈T are independent and identically
distributed as Bern(err(fˆ)).
0.01
0
Distribution of test error rate:
n · err(fˆ, T ) ∼ Bin(n, err(fˆ)) .
0
10
20
30
40
50
k
60
70
80
90
How do we quantify the probability mass in the tails?
5 / 14
Chernoff bound: large deviations
a
1−a
+ (1 − a) ln
≥ 0
b
1−b
6 / 14
Illustration of large deviations
Consider S ∼ Bin(n, 1/3) and u = 1/3 + 0.05 ≈ 0.383.
Let S ∼ Bin(n, p), and define
RE(akb) := a ln
100
exp(− RE(ukp)) ≈ 0.995
(= 0 iff a = b) ,
What is P(S ≥ u)?
the relative entropy between Bernoulli distributions with heads biases a and b.
(Measures how different the distributions are.)
Upper tail bound: For any u > p,
P(S ≥ n · u) ≤ exp − RE(ukp) · n .
Lower tail bound: For any ` < p,
P(S ≤ n · `) ≤ exp − RE(`kp) · n .
Both exponentially small in n.
Large deviations from mean p · n (e.g., (u − p) · n) are exponentially unlikely.
7 / 14
8 / 14
Moderate deviations
Moderate deviations
How large are “typical” deviations?
To derive the “fact’, can again use Chernoff bound
P(S ≥ n · u) ≤ exp − RE(ukp) · n .
“Fact”: S ∼ Bin(n, p) “typically” within few standard deviations from mean.
How small can u be before the bound exceeds some fixed δ ∈ (0, 1)?
0.025
By calculus, for u > p,
Pr[S=k]
0.02
RE(ukp) ≥
0.015
(u − p)2
.
2u
Therefore, for u > p,
!
(u − p)2
P(S ≥ n · u) ≤ exp − RE(ukp) · n ≤ exp −
·n .
2u
0.01
0.005
0
0
100
200
300
400
500
k
600
700
800
900
By algebra, the RHS is δ when
1000
n·u = n·p+
Bin(1000, 1/3)
p
np ≈ 333.333, 2 np(1 − p) ≈ 29.8142
p
2np ln(1/δ) + 2 ln(1/δ) .
9 / 14
Moderate deviations
10 / 14
Moderate deviations
Combining upper and lower tail bounds: for p ≤ 1/2,
h
i
p
p
P S ∈ np − 2np ln(1/δ), np + 2np ln(1/δ) + 2 ln(1/δ)
≥ 1 − 2δ .
Similar argument for lower tail.
By calculus, for ` < p ≤ 1/2,
RE(`kp) ≥
(p − `)2
.
2p
1-2δ
Therefore, for ` < p ≤ 1/2,
!
(p − `)2
P(S ≤ n · `) ≤ exp − RE(`kp) · n ≤ exp −
·n .
2p
By algebra, the RHS is δ when
n·` = n·p−
p
2np ln(1/δ) .
δ
δ
Union bound: P(A ∪ B) ≤ P(A) + P(B)
11 / 14
12 / 14
Estimating a coin bias
Key takeaways
Another interpretation: estimating heads bias p ≤ 1/2 from i.i.d. sample
X1 , X2 , . . . , Xn with
p̂ :=
X1 + X2 + · · · + Xn
.
n
So
P p−
r
2p ln(1/δ)
≤ p̂ ≤ p +
n
r
2p ln(1/δ)
2 ln(1/δ)
+
n
n
!
√
1. Large (Ω(n)) and “typical” (O( n)) deviations for Bin(n, p).
≥ 1 − 2δ ;
2. Use of Chernoff bound to reason about error rates.
i.e., the estimate p̂ is usually reasonably close to the truth p.
How close? Depends on:
I
whether you’re asking about how far above p or how far below p
(upper and lower tails are somewhat asymmetric);
I
the sample size n;
I
the true heads bias p itself;
I
the “confidence” parameter δ.
Suggests rough idea of the resolution at which you can distinguish classifiers’
error rates, based on size of test set.
13 / 14
14 / 14
Related documents