Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Binomial distribution
Number of heads when a coin with heads bias p ∈ [0, 1] is tossed n times:
binomial distribution
S ∼ Bin(n, p) .
Probability mass function: for any k ∈ {0, 1, 2, . . . , n},
!
n k
P(S = k) =
p (1 − p)n−k .
k
Binomial distribution
0.1
Pr[S=k]
0.08
0.06
0.04
0.02
0
0
5
10
15
20
1 / 14
Special case: Bernoulli distribution
35
40
45
50
2 / 14
Let X1 , X2 , . . . , Xn be i.i.d. Bern(p) random variables, and let S ∼ Bin(n, p).
Then S has the same distribution as X1 + X2 + · · · + Xn .
Bernoulli distribution
X ∼ Bern(p) = Bin(1, p)
Mean: By linearity of expectation,
n
n
X
X
E(S) = E
Xi =
E(Xi ) = np .
P(X = 0) = 1 − p .
i=1
Mean:
E(X) = P(X = 0) · 0 + P(X = 1) · 1 = p .
i=1
Variance: Since X1 , X2 , . . . , Xn are independent,
n
n
X
X
var(S) = var
Xi =
var(Xi ) = np(1 − p) .
Variance:
h
2 i
var(X) = E X − E(X)
= p(1 − p) .
(Standard deviation is
30
Binomial = sums of i.i.d. Bernoullis
The outcome of a coin toss with heads bias p ∈ [0, 1]:
P(X = 1) = p ,
25
k
i=1
p
var(X); more convenient to use than E |X − E(X)| .)
3 / 14
i=1
4 / 14
Test error rate
Deviations from the mean
Let fˆ: X → Y be a classifier, and suppose you have i.i.d. test data T
(that are independent of fˆ); let n := |T |.
Question: What are the “typical” values (i.e., non-tail event) of
S ∼ Bin(n, p)?
True error rate (with (X, Y ) ∼ P):
0.08
err(fˆ) = P(fˆ(X) 6= Y ) .
0.07
0.06
1
err(fˆ, T ) =
n
X
(x,y)∈T
Pr[S=k]
Test error rate:
1{fˆ(x) 6= y} .
0.05
0.04
0.03
0.02
The random variables {1{fˆ(x) 6= y}}(x,y)∈T are independent and identically
distributed as Bern(err(fˆ)).
0.01
0
Distribution of test error rate:
n · err(fˆ, T ) ∼ Bin(n, err(fˆ)) .
0
10
20
30
40
50
k
60
70
80
90
How do we quantify the probability mass in the tails?
5 / 14
Chernoff bound: large deviations
a
1−a
+ (1 − a) ln
≥ 0
b
1−b
6 / 14
Illustration of large deviations
Consider S ∼ Bin(n, 1/3) and u = 1/3 + 0.05 ≈ 0.383.
Let S ∼ Bin(n, p), and define
RE(akb) := a ln
100
exp(− RE(ukp)) ≈ 0.995
(= 0 iff a = b) ,
What is P(S ≥ u)?
the relative entropy between Bernoulli distributions with heads biases a and b.
(Measures how different the distributions are.)
Upper tail bound: For any u > p,
P(S ≥ n · u) ≤ exp − RE(ukp) · n .
Lower tail bound: For any ` < p,
P(S ≤ n · `) ≤ exp − RE(`kp) · n .
Both exponentially small in n.
Large deviations from mean p · n (e.g., (u − p) · n) are exponentially unlikely.
7 / 14
8 / 14
Moderate deviations
Moderate deviations
How large are “typical” deviations?
To derive the “fact’, can again use Chernoff bound
P(S ≥ n · u) ≤ exp − RE(ukp) · n .
“Fact”: S ∼ Bin(n, p) “typically” within few standard deviations from mean.
How small can u be before the bound exceeds some fixed δ ∈ (0, 1)?
0.025
By calculus, for u > p,
Pr[S=k]
0.02
RE(ukp) ≥
0.015
(u − p)2
.
2u
Therefore, for u > p,
!
(u − p)2
P(S ≥ n · u) ≤ exp − RE(ukp) · n ≤ exp −
·n .
2u
0.01
0.005
0
0
100
200
300
400
500
k
600
700
800
900
By algebra, the RHS is δ when
1000
n·u = n·p+
Bin(1000, 1/3)
p
np ≈ 333.333, 2 np(1 − p) ≈ 29.8142
p
2np ln(1/δ) + 2 ln(1/δ) .
9 / 14
Moderate deviations
10 / 14
Moderate deviations
Combining upper and lower tail bounds: for p ≤ 1/2,
h
i
p
p
P S ∈ np − 2np ln(1/δ), np + 2np ln(1/δ) + 2 ln(1/δ)
≥ 1 − 2δ .
Similar argument for lower tail.
By calculus, for ` < p ≤ 1/2,
RE(`kp) ≥
(p − `)2
.
2p
1-2δ
Therefore, for ` < p ≤ 1/2,
!
(p − `)2
P(S ≤ n · `) ≤ exp − RE(`kp) · n ≤ exp −
·n .
2p
By algebra, the RHS is δ when
n·` = n·p−
p
2np ln(1/δ) .
δ
δ
Union bound: P(A ∪ B) ≤ P(A) + P(B)
11 / 14
12 / 14
Estimating a coin bias
Key takeaways
Another interpretation: estimating heads bias p ≤ 1/2 from i.i.d. sample
X1 , X2 , . . . , Xn with
p̂ :=
X1 + X2 + · · · + Xn
.
n
So
P p−
r
2p ln(1/δ)
≤ p̂ ≤ p +
n
r
2p ln(1/δ)
2 ln(1/δ)
+
n
n
!
√
1. Large (Ω(n)) and “typical” (O( n)) deviations for Bin(n, p).
≥ 1 − 2δ ;
2. Use of Chernoff bound to reason about error rates.
i.e., the estimate p̂ is usually reasonably close to the truth p.
How close? Depends on:
I
whether you’re asking about how far above p or how far below p
(upper and lower tails are somewhat asymmetric);
I
the sample size n;
I
the true heads bias p itself;
I
the “confidence” parameter δ.
Suggests rough idea of the resolution at which you can distinguish classifiers’
error rates, based on size of test set.
13 / 14
14 / 14