Download Powerpoint

PGM: Tirgul 10 Parameter Learning and Priors . Why learning? Knowledge acquisition bottleneck  Knowledge acquisition is an expensive process  Often we don’t have an expert Data is cheap  Vast amounts of data becomes available to us  Learning allows us to build systems based on the data 2 Learning Bayesian networks B E Data + Prior information Inducer R A C E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01 3 Known Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Inducer E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network  B E structure is specified Inducer needs to estimate parameters  Data does not contain missing values 4 Unknown Structure -- Complete Data E, B, A <Y,N,N> <Y,Y,Y> <N,N,Y> <N,Y,Y> . . <N,Y,Y> Inducer E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E A A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network  B E structure is not specified Inducer needs to select arcs & estimate parameters  Data does not contain missing values 5 Known Structure -- Incomplete Data E, B, A <Y,N,N> <Y,?,Y> <N,N,Y> <N,Y,?> . . <?,Y,Y> E B P(A | E,B) e b ? ? e b ? ? e b ? ? e b ? ? B E Inducer A B E A E B P(A | E,B) e b .9 .1 e b .7 .3 e b .8 .2 e b .99 .01  Network structure is specified  Data contains missing values  We consider assignments to missing values 6 Known Structure / Complete Data  Given  a network structure G And choice of parametric family for P(Xi|Pai)  Learn parameters for network Goal  Construct a network that is “closest” to probability that generated the data 7 Example: Binomial Experiment (Statistics 101) Head Tail  When tossed, it can land in one of two positions: Head or Tail  We denote by  the (unknown) probability P(H). Estimation task:  Given a sequence of toss samples x[1], x[2], …, x[M] we want to estimate the probabilities P(H)=  and P(T) = 1 -  8 Statistical Parameter Fitting  Consider instances x[1], x[2], …, x[M] such that    The set of values that x can take is known Each is sampled from the same distribution Each sampled independently of the rest i.i.d. samples task is to find a parameter  so that the data can be summarized by a probability P(x[j]|  ).  The   Depends on the given family of probability distributions: multinomial, Gaussian, Poisson, etc. For now, focus on multinomial distributions 9 The Likelihood Function  How good is a particular ? It depends on how likely it is to generate the observed data L( : D )  P (D | )   P (x [m ] | ) m likelihood for the sequence H,T, T, H, H is L( :D)  The L( : D )    (1  )  (1  )     0 0.2 0.4  0.6 0.8 1 10 Sufficient Statistics  To compute the likelihood in the thumbtack example we only require NH and NT (the number of heads and the number of tails) L( : D )  NH  (1  )NT  NH and NT are sufficient statistics for the binomial distribution 11 Sufficient Statistics A sufficient statistic is a function of the data that summarizes the relevant information for the likelihood  Formally, s(D) is a sufficient statistics if for any two datasets D and D’  s(D) = s(D’ )  L( |D) = L( |D’) Datasets Statistics 12 Maximum Likelihood Estimation MLE Principle: Choose parameters that maximize the likelihood function  This is one of the most commonly used estimators in statistics  Intuitively appealing 13 Example: MLE in Binomial Data  Applying the MLE principle we get NH ˆ  NH  NT (Which coincides with what one would expect) (NH,NT ) = (3,2) MLE estimate is 3/5 = 0.6 L( :D) Example: 0 0.2 0.4 0.6 0.8 1 14 Learning Parameters for a Bayesian Network  Training data has the form: B E A  E [1] B [1] A[1] C [1]         D         E [ M ] B [ M ] A [ M ] C [ M ]   C 15 Learning Parameters for a Bayesian Network  Since we assume i.i.d. samples, likelihood function is L( : D )   P (E [m], B [m], A[m],C [m] : ) B E A C m 16 Learning Parameters for a Bayesian Network  By B E definition of network, we get A L( : D )   P (E [m ], B [m ], A[m ], C [m ] : ) C m P (E [m ] : )  m P (B [m ] : ) P (A[m ] | B [m ], E [m ] : ) P (C [m ] | A[m ] : )  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   17 Learning Parameters for a Bayesian Network  Rewriting B E terms, we get A L( : D )   P (E [m ], B [m], A[m ], C [m ] : ) C m   P (E [m ] : ) m  P (B [m] : ) m  P (A[m] | B [m], E [m] : ) m  P (C [m] | A[m] : ) m  E [1] B [1] A[1] C [1]                  E [ M ] B [ M ] A [ M ] C [ M ]   18 General Bayesian Networks Generalizing for any Bayesian network: L( : D )   P (x 1 [m ], , xn [m ] : ) i.i.d. samples m    P (xi [m ] | Pai [m ] : i ) m i i m Network factorization    P (xi [m ] | Pai [m ] : i )   Li (i : D ) i  The likelihood decomposes according to the structure of the network. 19 General Bayesian Networks (Cont.) Decomposition  Independent Estimation Problems If the parameters for each family are not related, then they can be estimated independently of each other. 20 From Binomial to Multinomial example, suppose X can have the values 1,2,…,K  We want to learn the parameters  1,  2. …,  K Sufficient statistics:  N1, N2, …, NK - the number of times each outcome is observed K L( : D )   k Nk Likelihood function:  For k 1 MLE: Nk ˆ k   N  21 Likelihood for Multinomial Networks we assume that P(Xi | Pai ) is multinomial, we get further decomposition:  When Li (i : D )   P (xi [m ] | Pai [m ] : i ) m   P (x [m] | pa paii m ,Paii [ m ]  paii i i : i )   P (xi | pai : i )N ( xii , paii ) paii xii pai xi   xi |pai N ( xi , pai ) 22 Likelihood for Multinomial Networks we assume that P(Xi | Pai ) is multinomial, we get further decomposition:  When Li (i : D )   xi |pai pai N ( xi , pai ) xi each value pai of the parents of Xi we get an independent multinomial problem  For N (xi , pai ) ˆ  The MLE is xi |pai  N ( pai ) 23 Maximum Likelihood Estimation Consistency  Estimate converges to best possible value as the number of examples grow  To make this formal, we need to introduce some definitions 24 KL-Divergence P and Q be two distributions over X  A measure of distance between P and Q is the Kullback-Leibler Divergence  Let P (x ) KL(P ||Q )   P (x ) log Q (x ) x  KL(P||Q) = 1 (when logs are in base 2) =  The probability P assigns to an instance is, on average, half the probability Q assigns to it  KL(P||Q)  0  KL(P||Q) = 0 iff are P and Q equal 25 Consistency  Let  P(X| ) be a parametric family We need to make various regularity condition we won’t go into now P*(X) be the distribution that generates the data ˆD  Let be the MLE estimate given a dataset D  Let Thm  As N , ˆ   * D where  *  arg min KL(P * (X ) || P (X :  ) with probability 1 26 Consistency -- Geometric Interpretation Distributions that can represented by P(X| ) P* P(X| * ) Space of probability distribution 27 Is MLE all we need?  Suppose   that after 10 observations, ML estimates P(H) = 0.7 for the thumbtack Would you bet on heads for the next toss? Suppose now that after 10 observations, ML estimates P(H) = 0.7 for a coin Would you place the same bet? 28 Bayesian Inference Frequentist Approach: Assumes there is an unknown but fixed parameter   Estimates  with some confidence  Prediction by using the estimated parameter value  Bayesian Approach: Represents uncertainty about the unknown parameter  Uses probability to quantify this uncertainty:  Unknown parameters as random variables  Prediction follows from the rules of probability:  Expectation over the unknown parameters  29 Bayesian Inference (cont.)  We can represent our uncertainty about the sampling process using a Bayesian network  X[1] X[2] X[m] Observed data X[m+1] Query values of X are independent given   The conditional probabilities, P(x[m] |  ), are the parameters in the model  Prediction is now inference in this network  The 30 Bayesian Inference (cont.) Prediction as inference in this network P (x [M  1] | x [1], , x [M ]) X[1]  X[2] X[m] X[m+1]   P (x [M  1] | , x [1], , x [M ])P ( | x [1], , x [M ])d   P (x [M  1] | )P ( | x [1], , x [M ])d where Likelihood Prior P (x [1],  x [M ] | )P () P ( | x [1],  x [M ])  P (x [1],  x [M ]) Posterior Probability of data 31 Example: Binomial Data Revisited uniform for  in [0,1]  P( ) = 1  Then P( |D) is proportional to the likelihood L( :D)  Prior: P ( | x [1],x [M])  P (x [1],x [M] |  )  P ( ) (NH,NT ) = (4,1)  MLE for P(X = H ) is 4/5 = 0.8  Bayesian prediction is 0 P (x [M  1]  H | D )    P ( | D )d  0.2 0.4 5  0.7142  7 0.6 0.8 1 32 Bayesian Inference and MLE  In our example, MLE and Bayesian prediction differ But… If prior is well-behaved  Does not assign 0 density to any “feasible” parameter value Then: both MLE and Bayesian prediction converge to the same value  Both are consistent 33 Dirichlet Priors  Recall that the likelihood function is K L (  : D )    k Nk k 1 Dirichlet prior with hyperparameters 1,…,K is defined as A K P ()   k k 1 for legal  1,…,  K k 1 Then the posterior has the same form, with hyperparameters 1+N 1,…,K +N K K K k 1 k 1 K P ( | D )  P ()P (D | )   k k 1  k Nk   k k Nk 1 k 1 34 Dirichlet Priors (cont.)  We can compute the prediction on a new event in closed form:  If P() is Dirichlet with hyperparameters 1,…,K then k P (X [1]  k )   k P ()d      Since the posterior is also Dirichlet, we get  k  Nk P (X [M  1]  k | D )   k P ( | D )d   (   N  )  35 Dirichlet Priors -- Example 5 Dirichlet(1,1) Dirichlet(2,2) Dirichlet(0.5,0.5) Dirichlet(5,5) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 0.2 0.4 0.6 0.8 1 36 Prior Knowledge hyperparameters 1,…,K can be thought of as “imaginary” counts from our prior experience  The  Equivalent sample size = 1+…+K  The larger the equivalent sample size the more confident we are in our prior 37 Effect of Priors Prediction of P(X=H ) after seeing data with NH = 0.25•NT for different sample sizes 0.55 0.6 0.5 Different strength H + T Fixed ratio H / T 0.45 0.5 0.4 0.4 0.35 Fixed strength H + T Different ratio H / T 0.3 0.3 0.2 0.25 0.1 0.2 0.15 0 20 40 60 80 100 0 0 20 40 60 80 100 38 Effect of Priors (cont.)  In real data, Bayesian estimates are less sensitive to noise in the data P(X = 1|D) 0.7 MLE Dirichlet(.5,.5) Dirichlet(1,1) Dirichlet(5,5) Dirichlet(10,10) 0.6 0.5 0.4 0.3 0.2 0.1 N 5 10 15 20 25 30 35 40 45 50 1 Toss Result 0 N 39 Conjugate Families  The property that the posterior distribution follows the same parametric form as the prior distribution is called conjugacy  Dirichlet prior is a conjugate family for the multinomial likelihood  Conjugate families are useful since:  For many distributions we can represent them with hyperparameters  They allow for sequential update within the same representation  In many cases we have closed-form solution for prediction 40 Bayesian Networks and Bayesian Prediction Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  Priors for each parameter group are independent  Data instances are independent given the unknown parameters 41 Bayesian Networks and Bayesian Prediction (Cont.) Y|X X X m X[1] X[2] X[M] X[M+1] Y[1] Y[2] Y[M] Y[M+1] X[m] Y|X Y[m] Plate notation Observed data Query  We can also “read” from the network: Complete data  posteriors on parameters are independent 42 Bayesian Prediction(cont.)  Since posteriors on parameters for each family are independent, we can compute them separately  Posteriors for parameters within families are also independent: X m X[m] Y[m] Y|X X Refined model m X[m] Y|X=0 Y|X=1 Y[m] data  independent posteriors on Y|X=0 and  Y|X=1  Complete 43 Bayesian Prediction(cont.)  Given these observations, we can compute the posterior for each multinomial  Xi | pai independently  The posterior is Dirichlet with parameters (Xi=1|pai)+N (Xi=1|pai),…, (Xi=k|pai)+N (Xi=k|pai)  The predictive distribution is then represented by the parameters ~  x |pa i i  (xi , pai )  N (xi , pai )   ( pai )  N ( pai ) 44 Assessing Priors for Bayesian Networks We need the(xi,pai) for each node xj can use initial parameters 0 as prior information  Need also an equivalent sample size parameter M0  Then, we let (xi,pai) = M0P(xi,pai|0)  We  This allows to update a network using new data 45 Learning Parameters: Case Study (cont.) Experiment:  Sample a stream of instances from the alarm network  Learn parameters using  MLE estimator  Bayesian estimator with uniform prior with different strengths 46 Learning Parameters: Case Study (cont.) MLE Bayes w/ Uniform Prior, M'=5 Bayes w/ Uniform Prior, M'=10 Bayes w/ Uniform Prior, M'=20 Bayes w/ Uniform Prior, M'=50 1.4 KL Divergence 1.2 1 0.8 0.6 0.4 0.2 0 0 500 1000 1500 2000 2500 3000 3500 4000 4500 5000 M 47 Learning Parameters: Summary  Estimation relies on sufficient statistics  For multinomial these are of the form N (xi,pai)  Parameter estimation N (xi , pai ) ˆ x |pa  i i N ( pai ) MLE (xi , pai )  N (xi , pai ) ~ x |pa  i i ( pai )  N ( pai ) Bayesian (Dirichlet)  Bayesian methods also require choice of priors  Both MLE and Bayesian are asymptotically equivalent and consistent  Both can be implemented in an on-line manner by accumulating sufficient statistics 48

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Powerpoint