Download Probabilistic Inference

CS B553: ALGORITHMS FOR OPTIMIZATION AND LEARNING Parameter Learning (From data to distributions) AGENDA Learning probability distributions from example data  Generative vs. discriminative models  Maximum likelihood estimation (MLE)  Bayesian estimation  MOTIVATION Past lectures have studied how to infer characteristics of a distribution, given a fullyspecified Bayes net  Next few lectures: where does the Bayes net come from?  Setting for this lecture:  Given a set of examples drawn from a distribution  Each example is complete (fully observable)  BN structure is known, but the CPTs are unknown  DENSITY ESTIMATION Given dataset D={d[1],…,d[M]} drawn from underlying distribution P*  Find a distribution that matches P* as close as possible  High-level issues:      Usually, not enough data to get an accurate picture of P*, which forces us to approximate. Even if we did have P*, how do we measure closeness? How do we maximize closeness? Two approaches: learning problems => Optimization problems, or  Bayesian inference problems  KULLBACK-LIEBLER DIVERGENCE  Definition: given two probability distributions P and Q over X, the KL divergence (or relative entropy) from P to Q is given by: 𝑃 𝒙 𝐷(𝑃| 𝑄 = 𝐸𝒙~𝑃 log 𝑄 𝒙 𝑃 𝒙 = 𝑃(𝑥) log 𝑄 𝒙 𝒙  Properties:    𝐷(𝑃| 𝑄 ≥ 0 𝐷(𝑃||𝑄) = 0 iff P=Q “almost everywhere” Not a true “metric” – non-symmetric APPLYING KL DIVERGENCE TO LEARNING  Approach: given underlying distribution P*, find P (within a class of distributions) so KL divergence is minimized  arg min 𝐷(𝑃∗ | 𝑃 = arg min 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 − log 𝑃 𝒙 𝑃 = argmin 𝑃 𝑃 𝐸𝒙~𝑃∗ log 𝑃∗ 𝒙 ] − 𝐸𝒙~𝑃∗ [log 𝑃 𝒙 = arg max 𝐸𝒙~𝑃∗ log 𝑃 𝒙 𝑃  If we approximate P* with draws from D, we get 1 arg max log 𝑃 𝒅 𝑖 𝑃 𝐷 𝑖  Minimizing KL-divergence to the empirical distribution is the same as maximizing the empirical log-likelihood ANOTHER APPROACH: DISCRIMINATIVE LEARNING  Do we really want to model P*? We may be more concerned with predicting the values of some subset of variables  E.g., for a Bayes net CPT, we want P(Y|PaY) but may not care about the distribution of PaY Generative model: estimate P(X,Y)  Discriminative model: estimate P(Y|X), ignore P(X)  TRAINING DISCRIMINATIVE MODELS Define a loss function l(y,x,P) that is given the ground truth y,x  Measures the difference between the prediction P(Y|x) and the ground truth  Examples:  Classification error I[y  arg maxy P(y|x)]  Conditional log likelihood - log P(y|x)   Strategy: minimize empirical loss DISCRIMINATIVE VS GENERATIVE  Discriminative models: Don’t model the input distribution, so may have more expressive power for the same level of complexity  May learn more accurate predictive models for same sized training dataset  Directly transcribe top-down evaluation of CPTs   Generative models: More flexible, because they don’t require a priori selection of the dependent variable Y  Bottom-up inference is easier   Both useful in different situations WHAT CLASS OF PROBABILITY MODELS?  For small discrete distributions, just use a tabular representation   Very efficient learning techniques For large discrete distributions or continuous ones, the choice of probability model is crucial  Increasing complexity => Can represent complex distributions more accurately  Need more data to learn well (risk of overfitting)  More expensive to learn and to perform inference  LEARNING COIN FLIPS Let the unknown fraction of cherries be q (hypothesis)  Probability of drawing a cherry is q  Suppose draws are independent and identically distributed (i.i.d)  Observe that c out of N draws are cherries (data)  LEARNING COIN FLIPS Let the unknown fraction of cherries be q (hypothesis)  Intuition: c/N might be a good hypothesis  (or it might not, depending on the draw!)  MAXIMUM LIKELIHOOD  Likelihood of data d={d1,…,dN} given q  P(d|q) = Pj P(d |q) = q j i.i.d assumption c (1-q)N-c Gather c cherry terms together, then N-c lime terms MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 1.2 1 P(data|q)  0.8 0.6 1/1 cherry 0.4 0.2 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 1.2 1 P(data|q)  0.8 0.6 2/2 cherry 0.4 0.2 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 0.16 0.14 0.12 P(data|q)  0.1 0.08 2/3 cherry 0.06 0.04 0.02 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 0.07 0.06 0.05 P(data|q)  0.04 0.03 2/4 cherry 0.02 0.01 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 0.04 0.035 0.03 P(data|q)  0.025 0.02 2/5 cherry 0.015 0.01 0.005 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 0.0000012 0.000001 P(data|q)  0.0000008 0.0000006 10/20 cherry 0.0000004 0.0000002 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Likelihood of data d={d1,…,dN} given q  P(d|q) = qc (1-q)N-c 9E-31 8E-31 7E-31 P(data|q)  6E-31 5E-31 4E-31 50/100 cherry 3E-31 2E-31 1E-31 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD Peaks of likelihood function seem to hover around the fraction of cherries…  Sharpness indicates some notion of certainty…  9E-31 8E-31 P(data|q) 7E-31 6E-31 5E-31 4E-31 50/100 cherry 3E-31 2E-31 1E-31 0 0 0.2 0.4 0.6 q 0.8 1 MAXIMUM LIKELIHOOD P(d|q) be the likelihood function  The quantity argmaxq P(d|q) is known as the maximum likelihood estimate (MLE)  MAXIMUM LIKELIHOOD  l(q) = log P(d|q) = log [ qc (1-q)N-c] MAXIMUM LIKELIHOOD  l(q) = log P(d|q) = log [ qc (1-q)N-c] = log [ qc ] + log [(1-q)N-c] MAXIMUM LIKELIHOOD  l(q) = log P(d|q) = log [ qc (1-q)N-c] = log [ qc ] + log [(1-q)N-c] = c log q + (N-c) log (1-q) MAXIMUM LIKELIHOOD   l(q) = log P(d|q) = c log q + (N-c) log (1-q) Setting dl/dq(q) = 0 gives the maximum likelihood estimate MAXIMUM LIKELIHOOD   dl/dq(q) = c/q – (N-c)/(1-q) At MLE, c/q – (N-c)/(1-q) = 0 => q = c/N c and N are known as sufficient statistics for the parameter q – no other values give additional information about q OTHER MLE RESULTS Categorical distributions (Non-binary discrete variables): take fraction of counts for each value (histogram)  Continuous Gaussian distributions  Mean = average data  Standard deviation = standard deviation of data  MAXIMUM LIKELIHOOD FOR BN  For any BN, the ML parameters of any CPT can be derived by the fraction of observed values in the data, conditioned on matched parent values N=1000 E: 500 B: 200 P(E) = 0.5 P(B) = 0.2 Earthquake Burglar Alarm A|E,B: 19/20 A|B: 188/200 A|E: 170/500 A| : 1/380 E B P(A|E,B) T T 0.95 F T 0.95 T F 0.34 F F 0.003 PROOF Let BN have structure G over variables X1,…,Xn and parameters q  Given dataset D  L(q ; D) = Pm PG(d[m]; q)  PROOF Let BN have structure G over variables X1,…,Xn and parameters q  Given dataset D  L(q ; D) = Pm PG(d[m]; q) = Pm Pi PG(xi[m] | paXi[m]; q)  FITTING CPTS Each ML entry P(xi|paXi) is given by examining counts of (xi,paXi) in D and normalizing across rows of the CPT  Note that for large k=|PaXi|, very few datapoints will share the values of paXi!  O(|D|/2k), but some values may be even rarer  Large domains |Val(Xi)| can also be a problem  Data fragmentation  PROOF      Let BN have structure G over variables X1,…,Xn and parameters q Given dataset D L(q ; D) = Pm PG(d[m]; q) = Pm Pi PG(xi[m] | paXi[m]; q) = Pi [Pm PG(xi[m] | paXi[m]; q)] Pm PG(xi[m] | paXi[m]; q) is the likelihood of the local CPT of Xi: L(qXi ; D) Each CPT depends on a disjoint set of parameters qXi  => maximizing L(q ; D) over all parameters q is equivalent to maximizing L(qXi ; D) over each individual qXi AN ALTERNATIVE APPROACH: BAYESIAN ESTIMATION  P(q|d) = 1/Z P(d|q) P(q) is the posterior  Distribution of hypotheses given the data P(d|q) is the likelihood  P(q) is the hypothesis prior  q d[1] d[2] d[M] ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform  P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c  What’s P(Y|D)?  qi Y d[1] d[2] d[M] ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION Assume P(q) is uniform  P(q|d) = 1/Z P(d|q) = 1/Z qc(1-q)N-c  What’s P(Y|D)?   𝑃 𝑌𝐷 = 1 𝑃 0 𝑌 𝜃 𝑃 𝜃 𝐷 𝑑𝜃 = 1 1 𝑐 𝜃 𝜃 0 𝑍 1−𝜃 qi Y d[1] d[2] d[M] 𝑁−𝑐 𝑑𝜃 ASSUMPTION: UNIFORM PRIOR, BERNOULLI DISTRIBUTION  1 𝑎 𝜃 0 1 − 𝜃 𝑏 𝑑𝜃 = 𝑏!𝑎! 𝑎+𝑏+1 ! =>Z = c! (N-c)! / (N+1)!  =>P(Y) = 1/Z (c+1)! (N-c)! / (N+2)! = (c+1) / (N+2)  qi Can think of this as a “correction” using “virtual counts” Y d[1] d[2] d[M] NONUNIFORM PRIORS  P(q|d)  P(d|q)P(q) = qc (1-q)N-c P(q) Define, for all q, the probability that I believe in q P(q) 0 1 q BETA DISTRIBUTION  Betaa,b(q) = g qa-1 (1-q)b-1 a, b hyperparameters > 0  g is a normalization constant  a=b=1 is uniform distribution  POSTERIOR WITH BETA PRIOR  Posterior qc (1-q)N-c = g qc+a-1 (1-q)N-c+b-1 = Betaa+c,b+N-c(q)  Prediction = mean E[q]=(c+a)/(N+a+b) P(q) POSTERIOR WITH BETA PRIOR  What  does this mean? Prior specifies a “virtual count” of a=a-1 heads, b=b-1 tails  See heads, increment a  See tails, increment b  Effect of prior diminishes with more data CHOOSING A PRIOR Part of the design process; must be chosen according to your intuition  Uninformed belief a=b=1, strong belief => a,b high  EXTENSIONS OF BETA PRIORS  Parameters of categorical distributions: Dirichlet prior  Mathematical expression more complex, but in practice still takes the form of “virtual counts” Mean, standard deviation for Gaussian distributions: Gamma prior  Conjugate priors preserve the representation of prior and posterior distributions, but do not necessary exist for general distributions  DIRICHLET PRIOR Categorical variable |Val(X)|=k with P(X=i) = qi  Parameter space q1,…,qk with qi  0, S qi = 1  Maximum likelihood estimate given counts c1,…,ck in the data D:   qiML = ci/N Dirichlet prior is Dirichlet(a1,…,ak) = 1 𝛼1 −1 𝛼 −1 P 𝜃1 , … , 𝜃𝑘 = 𝜃1 × ⋯ × 𝜃𝑘 𝑘 𝑍  Mean is (a1/aT,…,ak/aT) with aT=Siai  Posterior P(q|D) is Dirichlet (a1+c1,…,ak+ck)  RECAP Learning => optimization problem (ML)  Learning => inference problem (Bayesian estimation)  Learning parameters of Bayesian networks  Conjugate priors 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Probabilistic Inference