Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Parameter Learning (From data to distributions) AGENDA Learning probability distributions from example data Generative vs. discriminative models Maximum likelihood estimation (MLE) Bayesian estimation MOTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fullyspecified Bayes net Next few lectures: where does the Bayes net come from? Setting for this lecture: Given a set of examples drawn from a distribution Each example is complete (fully observable) BN structure is known, but the CPTs are unknown DENSITY ESTIMATION Given dataset D={d[1],…,d[M]} drawn from underlying distribution P* Find a distribution that matches P* as close as possible High-level issues: Usually, not enough data to get an accurate picture of P*, which forces us to approximate. Even if we did have P*, how do we measure closeness? How do we maximize closeness? Two approaches: learning problems => Optimization problems, or Bayesian inference problems KULLBACK-LIEBLER DIVERGENCE Definition: given two probability distributions P and Q over X, the KL divergence (or relative entropy) from P to Q is given by: 𝑃 𝒙 𝐷(𝑃| 𝑄 = 𝐸𝒙~𝑃 log 𝑄 𝒙 𝑃 𝒙 = 𝑃(𝑥) log 𝑄 𝒙 𝒙 Properties: 𝐷(𝑃| 𝑄 ≥ 0 𝐷(𝑃||𝑄) = 0 iff P=Q “almost everywhere” Not a true “metric” – non-symmetric APPLYING KL DIVERGENCE TO LEARNING Approach: given underlying distribution P*, find P (within a class of distributions) so KL divergence is minimized arg min 𝐷(𝑃∗ | 𝑃 = arg min 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 − log 𝑃 𝒙 𝑃 = argmin 𝑃 𝑃 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 ] − 𝐸𝒙~𝑃∗ [log 𝑃 𝒙 = arg max 𝐸𝒙~𝑃∗ log 𝑃 𝒙 𝑃 If we approximate P* with draws from D, we get 1 arg max log 𝑃 𝒅 𝑖 𝑃 𝐷 𝑖 Minimizing KL-divergence to the empirical distribution is the same as maximizing the empirical log-likelihood ANOTHER APPROACH: DISCRIMINATIVE LEARNING Do we really want to model P*? We may be more concerned with predicting the values of some subset of variables E.g., for a Bayes net CPT, we want P(Y|PaY) but may not care about the distribution of PaY Generative model: estimate P(X,Y) Discriminative model: estimate P(Y|X), ignore P(X) TRAINING DISCRIMINATIVE MODELS Define a loss function l(y,x,P) that is given the ground truth y,x Measures the difference between the prediction P(Y|x) and the ground truth Examples: Classification error I[y arg maxy P(y|x)] Conditional log likelihood - log P(y|x) Strategy: minimize empirical loss DISCRIMINATIVE VS GENERATIVE Discriminative models: Don’t model the input distribution, so may have more expressive power for the same level of complexity May learn more accurate predictive models for same sized training dataset Directly transcribe top-down evaluation of CPTs Generative models: More flexible, because they don’t require a priori selection of the dependent variable Y Bottom-up inference is easier Both useful in different situations WHAT CLASS OF PROBABILITY MODELS? For small discrete distributions, just use a tabular representation Very efficient learning techniques For large discrete distributions or continuous ones, the choice of probability model is crucial Increasing complexity => Can represent complex distributions more accurately Need more data to learn well (risk of overfitting) More expensive to learn and to perform inference LEARNING COIN FLIPS Let the unknown fraction of cherries be q (hypothesis) Probability of drawing a cherry is q Suppose draws are independent and identically distributed (i.i.d) Observe that c out of N draws are cherries (data) LEARNING COIN FLIPS Let the unknown fraction of cherries be q (hypothesis) Intuition: c/N might be a good hypothesis (or it might not, depending on the draw!) MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = Pj P(d |q) = q j i.i.d assumption c (1-q)N-c Gather c cherry terms together, then N-c lime terms MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 1.2 1 P(data|q) 0.8 0.6 1/1 cherry 0.4 0.2 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 1.2 1 P(data|q) 0.8 0.6 2/2 cherry 0.4 0.2 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 0.16 0.14 0.12 P(data|q) 0.1 0.08 2/3 cherry 0.06 0.04 0.02 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 0.07 0.06 0.05 P(data|q) 0.04 0.03 2/4 cherry 0.02 0.01 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 0.04 0.035 0.03 P(data|q) 0.025 0.02 2/5 cherry 0.015 0.01 0.005 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 0.0000012 0.000001 P(data|q) 0.0000008 0.0000006 10/20 cherry 0.0000004 0.0000002 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q P(d|q) = qc (1-q)N-c 9E-31 8E-31 7E-31 P(data|q) 6E-31 5E-31 4E-31 50/100 cherry 3E-31 2E-31 1E-31 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries… Sharpness indicates some notion of certainty… 9E-31 8E-31 P(data|q) 7E-31 6E-31 5E-31 4E-31 50/100 cherry 3E-31 2E-31 1E-31 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD P(d|q) be the likelihood function The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE) MAXIMUM LIKELIHOOD l(q) = log P(d|q) = log [ qc (1-q)N-c] MAXIMUM LIKELIHOOD l(q) = log P(d|q) = log [ qc (1-q)N-c] = log [ qc ] + log [(1-q)N-c] MAXIMUM LIKELIHOOD l(q) = log P(d|q) = log [ qc (1-q)N-c] = log [ qc ] + log [(1-q)N-c] = c log q + (N-c) log (1-q) MAXIMUM LIKELIHOOD l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood estimate MAXIMUM LIKELIHOOD dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0 => q = c/N c and N are known as sufficient statistics for the parameter q – no other values give additional information about q OTHER MLE RESULTS Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram) Continuous Gaussian distributions Mean = average data Standard deviation = standard deviation of data MAXIMUM LIKELIHOOD FOR BN For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 E: 500 B: 200 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar Alarm A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 E B P(A|E,B) T T 0.95 F T 0.95 T F 0.34 F F 0.003 PROOF Let BN have structure G over variables X1,…,Xn and parameters q Given dataset D L(q ; D) = Pm PG(d[m]; q) PROOF Let BN have structure G over variables X1,…,Xn and parameters q Given dataset D L(q ; D) = Pm PG(d[m]; q) = Pm Pi PG(xi[m] | paXi[m]; q) FITTING CPTS Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT Note that for large k=|PaXi|, very few datapoints will share the values of paXi! O(|D|/2k), but some values may be even rarer Large domains |Val(Xi)| can also be a problem Data fragmentation PROOF Let BN have structure G over variables X1,…,Xn and parameters q Given dataset D L(q ; D) = Pm PG(d[m]; q) = Pm Pi PG(xi[m] | paXi[m]; q) = Pi [Pm PG(xi[m] | paXi[m]; q)] Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local CPT of Xi: L(qXi ; D) Each CPT depends on a disjoint set of parameters qXi => maximizing L(q ; D) over all parameters q is equivalent to maximizing L(qXi ; D) over each individual qXi AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION P(q|d) = 1/Z P(d|q) P(q) is the posterior Distribution of hypotheses given the data P(d|q) is the likelihood P(q) is the hypothesis prior q d[1] d[2] d[M] ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c What’s P(Y|D)? qi Y d[1] d[2] d[M] ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c What’s P(Y|D)? 𝑃 𝑌𝐷 = 1 𝑃 0 𝑌 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 1 1 𝑐 𝜃 𝜃 0 𝑍 1−𝜃 qi Y d[1] d[2] d[M] 𝑁−𝑐 𝑑𝜃 ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION 1 𝑎 𝜃 0 1 − 𝜃 𝑏 𝑑𝜃 = 𝑏!𝑎! 𝑎+𝑏+1 ! =>Z = c! (N-c)! / (N+1)! =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2) qi Can think of this as a “correction” using “virtual counts” Y d[1] d[2] d[M] NONUNIFORM PRIORS P(q|d) P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) 0 1 q BETA DISTRIBUTION Betaa,b(q) = g qa-1 (1-q)b-1 a, b hyperparameters > 0 g is a normalization constant a=b=1 is uniform distribution POSTERIOR WITH BETA PRIOR Posterior qc (1-q)N-c = g qc+a-1 (1-q)N-c+b-1 = Betaa+c,b+N-c(q) Prediction = mean E[q]=(c+a)/(N+a+b) P(q) POSTERIOR WITH BETA PRIOR What does this mean? Prior specifies a “virtual count” of a=a-1 heads, b=b-1 tails See heads, increment a See tails, increment b Effect of prior diminishes with more data CHOOSING A PRIOR Part of the design process; must be chosen according to your intuition Uninformed belief a=b=1, strong belief => a,b high EXTENSIONS OF BETA PRIORS Parameters of categorical distributions: Dirichlet prior Mathematical expression more complex, but in practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian distributions: Gamma prior Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions DIRICHLET PRIOR Categorical variable |Val(X)|=k with P(X=i) = qi Parameter space q1,…,qk with qi 0, S qi = 1 Maximum likelihood estimate given counts c1,…,ck in the data D: qiML = ci/N Dirichlet prior is Dirichlet(a1,…,ak) = 1 𝛼1 −1 𝛼 −1 P 𝜃1 , … , 𝜃𝑘 = 𝜃1 × ⋯ × 𝜃𝑘 𝑘 𝑍 Mean is (a1/aT,…,ak/aT) with aT=Siai Posterior P(q|D) is Dirichlet (a1+c1,…,ak+ck) RECAP Learning => optimization problem (ML) Learning => inference problem (Bayesian estimation) Learning parameters of Bayesian networks Conjugate priors