Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Binomial distribution Number of heads when a coin with heads bias p ∈ [0, 1] is tossed n times: binomial distribution S ∼ Bin(n, p) . Probability mass function: for any k ∈ {0, 1, 2, . . . , n}, ! n k P(S = k) = p (1 − p)n−k . k Binomial distribution 0.1 Pr[S=k] 0.08 0.06 0.04 0.02 0 0 5 10 15 20 1 / 14 Special case: Bernoulli distribution 35 40 45 50 2 / 14 Let X1 , X2 , . . . , Xn be i.i.d. Bern(p) random variables, and let S ∼ Bin(n, p). Then S has the same distribution as X1 + X2 + · · · + Xn . Bernoulli distribution X ∼ Bern(p) = Bin(1, p) Mean: By linearity of expectation, n n X X E(S) = E Xi = E(Xi ) = np . P(X = 0) = 1 − p . i=1 Mean: E(X) = P(X = 0) · 0 + P(X = 1) · 1 = p . i=1 Variance: Since X1 , X2 , . . . , Xn are independent, n n X X var(S) = var Xi = var(Xi ) = np(1 − p) . Variance: h 2 i var(X) = E X − E(X) = p(1 − p) . (Standard deviation is 30 Binomial = sums of i.i.d. Bernoullis The outcome of a coin toss with heads bias p ∈ [0, 1]: P(X = 1) = p , 25 k i=1 p var(X); more convenient to use than E |X − E(X)| .) 3 / 14 i=1 4 / 14 Test error rate Deviations from the mean Let fˆ: X → Y be a classifier, and suppose you have i.i.d. test data T (that are independent of fˆ); let n := |T |. Question: What are the “typical” values (i.e., non-tail event) of S ∼ Bin(n, p)? True error rate (with (X, Y ) ∼ P): 0.08 err(fˆ) = P(fˆ(X) 6= Y ) . 0.07 0.06 1 err(fˆ, T ) = n X (x,y)∈T Pr[S=k] Test error rate: 1{fˆ(x) 6= y} . 0.05 0.04 0.03 0.02 The random variables {1{fˆ(x) 6= y}}(x,y)∈T are independent and identically distributed as Bern(err(fˆ)). 0.01 0 Distribution of test error rate: n · err(fˆ, T ) ∼ Bin(n, err(fˆ)) . 0 10 20 30 40 50 k 60 70 80 90 How do we quantify the probability mass in the tails? 5 / 14 Chernoff bound: large deviations a 1−a + (1 − a) ln ≥ 0 b 1−b 6 / 14 Illustration of large deviations Consider S ∼ Bin(n, 1/3) and u = 1/3 + 0.05 ≈ 0.383. Let S ∼ Bin(n, p), and define RE(akb) := a ln 100 exp(− RE(ukp)) ≈ 0.995 (= 0 iff a = b) , What is P(S ≥ u)? the relative entropy between Bernoulli distributions with heads biases a and b. (Measures how different the distributions are.) Upper tail bound: For any u > p, P(S ≥ n · u) ≤ exp − RE(ukp) · n . Lower tail bound: For any ` < p, P(S ≤ n · `) ≤ exp − RE(`kp) · n . Both exponentially small in n. Large deviations from mean p · n (e.g., (u − p) · n) are exponentially unlikely. 7 / 14 8 / 14 Moderate deviations Moderate deviations How large are “typical” deviations? To derive the “fact’, can again use Chernoff bound P(S ≥ n · u) ≤ exp − RE(ukp) · n . “Fact”: S ∼ Bin(n, p) “typically” within few standard deviations from mean. How small can u be before the bound exceeds some fixed δ ∈ (0, 1)? 0.025 By calculus, for u > p, Pr[S=k] 0.02 RE(ukp) ≥ 0.015 (u − p)2 . 2u Therefore, for u > p, ! (u − p)2 P(S ≥ n · u) ≤ exp − RE(ukp) · n ≤ exp − ·n . 2u 0.01 0.005 0 0 100 200 300 400 500 k 600 700 800 900 By algebra, the RHS is δ when 1000 n·u = n·p+ Bin(1000, 1/3) p np ≈ 333.333, 2 np(1 − p) ≈ 29.8142 p 2np ln(1/δ) + 2 ln(1/δ) . 9 / 14 Moderate deviations 10 / 14 Moderate deviations Combining upper and lower tail bounds: for p ≤ 1/2, h i p p P S ∈ np − 2np ln(1/δ), np + 2np ln(1/δ) + 2 ln(1/δ) ≥ 1 − 2δ . Similar argument for lower tail. By calculus, for ` < p ≤ 1/2, RE(`kp) ≥ (p − `)2 . 2p 1-2δ Therefore, for ` < p ≤ 1/2, ! (p − `)2 P(S ≤ n · `) ≤ exp − RE(`kp) · n ≤ exp − ·n . 2p By algebra, the RHS is δ when n·` = n·p− p 2np ln(1/δ) . δ δ Union bound: P(A ∪ B) ≤ P(A) + P(B) 11 / 14 12 / 14 Estimating a coin bias Key takeaways Another interpretation: estimating heads bias p ≤ 1/2 from i.i.d. sample X1 , X2 , . . . , Xn with p̂ := X1 + X2 + · · · + Xn . n So P p− r 2p ln(1/δ) ≤ p̂ ≤ p + n r 2p ln(1/δ) 2 ln(1/δ) + n n ! √ 1. Large (Ω(n)) and “typical” (O( n)) deviations for Bin(n, p). ≥ 1 − 2δ ; 2. Use of Chernoff bound to reason about error rates. i.e., the estimate p̂ is usually reasonably close to the truth p. How close? Depends on: I whether you’re asking about how far above p or how far below p (upper and lower tails are somewhat asymmetric); I the sample size n; I the true heads bias p itself; I the “confidence” parameter δ. Suggests rough idea of the resolution at which you can distinguish classifiers’ error rates, based on size of test set. 13 / 14 14 / 14