Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Learning in High Dimensions with Trees and Mixtures Marina Meila Carnegie Mellon University Multidimensional data · Multidimensional (noisy) data Learning · Learning tasks - intelligent data analysis · · · · categorization (clustering) classification novelty detection probabilistic reasoning · Data is changing, growing · Tasks change need to make learning automatic, efficient Combining probability and algorithms · Automatic · Efficient probability and statistics algorithms · This talk the tree statistical model Talk overview Introduction: statistical models Perspective: generative models and decision tasks The tree model Mixtures of trees Learning Experiments Accelerated learning Bayesian learning A multivariate domain · Data Patient1 Patient2 Smoker Bronchitis Lung cancer Cough X ray Smoker Bronchitis Lung cancer Cough X ray ............ Statistical model · Queries · Diagnose new patient Smoker Bronchitis Lung cancer? Cough X ray Cough X ray · Is smoking related to lung cancer? Smoker Bronchitis Lung cancer? · Understand the “laws” of the domain Probabilistic approach · Smoker, Bronchitis .. (discrete) random variables · Statistical model (joint distribution) P( Smoker, Bronchitis, Lung cancer, Cough, X ray ) summarizes knowledge about domain · Queries: · inference e.g. P( Lung cancer = true | Smoker = true, Cough = false ) · structure of the model • discovering relationships • categorization Probability table representation v1v2 00 01 11 00 v3 0 .01 .14 .22 .01 1 .23 .03 .33 .03 · Query: P(v1=0 | v2=1) = P(v1=0, v2=1) P(v2=1) = .14 + .03 .14 + .3 + .22 + .33 = .23 · Curse of dimensionality if v1, v2, … vn binary variables · · · · How to represent? How to query? How to learn from data? Structure? PV1,V2…Vn table with 2n entries! Graphical models · Structure · vertices = variables · edges = “direct dependencies” · Parametrization · by local probability tables Galaxy type spectrum dust Obs spectrum photometric measurement · · · · compact parametric representation efficient computation learning parameters by simple formula learning structure is NP-hard distance size Z (redshift) observed size The tree statistical model · Structure · tree (graph with no cycles) Parameters · probability tables associated to edges 1 1 T3 3 2 T(x) = 4 5 T34 3 equivalent 2 P T (x x ) uv E uv u v deg v-1 P Tv(xv) T(x) = v V • T(x) factors over tree edges 4 5 T4|3 P Tv|u(xv|xu) uv E Examples · Splice junction domain junction type -7 -6 -5 -4 -3 -2 -1 +2 +1 +3 +4 +5 · Premature babies’ Bronho-Pulmonary Disease (BPD) PulmHemorrh Coag HyperNa Acidosis Gestation Thrombocyt Weight Hypertension Temperature BPD Neutropenia Suspect Lipid +6 +7 +8 Trees - basic operations T(x) = P Tuv(xuxv) uv E P Tvdeg (xvv) -1 v V |V| =n Querying the model Estimating the model · · · · · computing likelihood T(x) ~ n conditioning TV-A|A (junction tree algorithm) ~ n marginalization Tuv for arbitrary u,v ~ n sampling ~ n fitting to a given distribution ~ n2 • learning from data ~ n2Ndata · is a simple model The mixture of trees m Q(x) = S lkTk(x) k=1 h = “hidden” variable P( h=k ) = lk k = 1, 2 . . . m · NOT a graphical model · computational efficiency preserved (Meila 97) Learning - problem formulation · Maximum Likelihood learning · given a data set D = { x1, . . . xN } · find the model that best predicts the data Topt = argmax T(D) · Fitting a tree to a distribution · given a data set D = { x1, . . . xN } and distribution P that weights each data point, · find Topt = argmin KL( P || T ) · KL is Kullbach-Leibler divergence · includes Maximum likelihood learning as a special case Fitting a tree to a distribution Topt = argmin KL( P || T ) · optimization over structure + parameters · sufficient statistics · probability tables Puv = Nuv/N u,v · mutual informations Iuv Iuv = S V Puv Puv log PuPv (Chow & Liu 68) Fitting a tree to a distribution - solution · Structure Eopt = argmax Suv IEuv I12 I23 · found by Maximum Weight Spanning Tree algorithm with edge weights Iuv I34 I56 I45 · Parameters · copy marginals of P Tuv = Puv for uv I63 E I61 Learning mixtures by the EM algorithm Meila & Jordan ‘97 E step which xi come from T k? distribution P k(x) M step fit T k to set of points min KL( Pk||Tk ) · Initialize randomly · converges to local maximum of the likelihood Remarks · Learning a tree · solution is globally optimal over structures and parameters · tractable: running time ~ n2N · Learning a mixture by the EM algorithm · both E and M steps are exact, tractable · running time • E step ~ mnN • M step ~ mn2N · assumes m known · converges to local optimum Finding structure - the bars problem Data n=25 learned structure Structure recovery: 19 out of 20 trials Hidden variable accuracy: 0.85 +/- 0.08 (ambiguous) 0.95 +/- 0.01 (unambiguous) Data likelihood [bits/data point] true model 8.58 learned model 9.82 +/-0.95 Experiments - density estimation · Digits and digit pairs Ntrain = 6000 Nvalid = 2000 Ntest = 5000 n = 64 variables ( m = 16 trees ) n = 128 variables ( m = 32 trees ) DNA splice junction classification · n = 61 variables · class = Intron/Exon, Exon/Intron, Neither Supervised (DELVE) Tree TANB NB Discovering structure Tree adjacency matrix class IE junction Intron Exon 15 16 . . . 25 26 27 28 29 30 31 Tree - CT CT CT - CT A G G True CT CT CT CT - CT A G G (Watson “The molecular biology of the gene” 87) EI junction Exon 28 29 30 31 32 Tree CA A G G T True CA A G G T Intron 33 34 35 36 AG A G AG A G T Irrelevant variables 61 original variables + 60 “noise” variables Original Augmented with irrelevant variables Accelerated tree learning · Running time for the tree learning algorithm ~ n2N · Quadratic running time may be too slow: Example: document classification · document = data point --> N = 103-4 · word = variable --> n = 103-4 · sparse data --> #words in document s and s << n,N · Can sparsity be exploited to create faster algorithms? Meila ‘99 Sparsity · assume special value “0” that occurs frequently sparsity = s # non-zero variables in each data point s s << n, N · Idea: “do not represent / count zeros” Sparse data 010000100001000 Linked list 000100000100000 length 010000000000001 s Presort mutual informations Theorem (Meila,99) If v, v’ are variables that do not cooccur with u in V (i.e. Nuv = Nuv’ = 0 ) then Nv > Nv’ ==> Iuv > Iuv’ · Consequences · sort Nv => all edges uv , Nuv = 0 implicitly sorted by Iuv · these edges need not be represented explicitly · construct black box that outputs next “largest” edge The black box data structure Nv v1 v2 list of u , Nuv > 0, sorted by Iuv v F-heap of size ~n list of u, Nuv =0, sorted by Nv (virtual) vn next edge uv n log n + s2N + nK log n (standard alg. running time n2N ) Total running time Experiments - sparse binary data Standard accelerated · N = 10,000 · s = 5, 10, 15, 100 Remarks · · · · Realistic assumption Exact algorithm, provably efficient time bounds Degrades slowly to the standard algorithm if data not sparse General · non-integer counts · multi-valued discrete variables Bayesian learning of trees Meila & Jaakkola ‘00 · Problem · given prior distribution over trees P0(T) data D = { x1, . . . xN } · find posterior distribution P(T|D) · Advantages · incorporates prior knowledge · regularization · Solution · Bayes’ formula P(T|D) = 1 P0(T) P T(xi) i=1,N Z · practically hard • distribution over structure E and parameters hard to represent • computing Z is intractable in general • exception: conjugate priors qE Decomposable priors T = P f( u, v, qu|v) uv E · want priors that factor over tree edges · prior for structure E P0(E) a P buv uv E · prior for tree parameters P0(qE) = P D( qu|v ; N’uv ) uv E · (hyper) Dirichlet with hyper-parameters N’uv(xuxv), u,v · posterior is also Dirichlet with hyper-parameters Nuv(xuxv) + N’uv(xuxv), u,v V V Decomposable posterior · Posterior distribution P(T|D) a P Wuv uv E · factored over edges · same form as prior Wuv = buv D( qu|v; N’uv + Nuv ) · Remains to compute the normalization constant Discrete: graph theory continuous: Meila & Jaakkola 99 The Matrix tree theorem · Matrix tree theorem If P0(E) = 1 Z P buv, buv uv E v -buv M( b ) = Then -buv Sb v’ Z = det M( vv' b) u 0 Remarks on the decomposable prior · Is a conjugate prior for the tree distribution · Is tractable · · · · defined by ~ n2 parameters computed exactly in ~ n3 operations posterior obtained in ~ n2N + n3 operations derivatives w.r.t parameters, averaging, . . . ~ n3 · Mixtures of trees with decomposable priors · MAP estimation with EM algorithm tractable · Other applications · ensembles of trees · maximum entropy distributions on trees So far . . · Trees and mixtures of trees are structured statistical models · Algorithmic techniques enable efficient learning • mixture of trees • accelerated algorithm • matrix tree theorem & Bayesian learning · Examples of usage · Structure learning · Compression · Classification Generative models and discrimination · Trees are generative models · descriptive · can perform many tasks suboptimally · Maximum Entropy discrimination (Jaakkola,Meila,Jebara,’99) · · · · optimize for specific tasks use generative models combine simple models into ensembles complexity control - by information theoretic principle · Discrimination tasks · detecting novelty · diagnosis · classification Bridging the gap Tasks Descriptive learning Discriminative learning Future . . . · Tasks have structure · · · · multi-way classification multiple indexing of documents gene expression data hierarchical, sequential decisions Learn structured decision tasks · sharing information btw tasks (transfer) · modeling dependencies btw decisions