Download Document

COMP 538 Introduction of Bayesian networks Lecture 16: Wrap-Up Phylogeny / Slide 2 Nevin L. Zhang, HKUST Recap  Latent class models     Clustering Clustering criterion: conditional independence Drawback: Assumption too strong Hierarchical latent class (HLC) models   Identifiability issues: regularity, equivalence Hill climbing algorithm Phylogeny / Slide 3 Nevin L. Zhang, HKUST Today  Phylogenetic (evolution) trees   Closely related to HLC models An example of viewing existing models in the framework of BN – Another example: HMM  Interesting because – Ease understanding – Techniques in one field applied to another  Structural EM for phylogenetic trees  Dynamic BNs for speech understanding – Development of general purpose algorithms  Bayesian networks for classification  Hand waving only Phylogeny / Slide 4 Nevin L. Zhang, HKUST Phylogenetic Tree Outline  Introduction to phylogenetic trees  Probabilistic models of evolution  Tree reconstruction Phylogeny / Slide 5 Nevin L. Zhang, HKUST Phylogenetic Trees  Assumption    Phylogeny   All organisms on Earth have a common ancestor This implies that any set of species is related. The relationship between any set of species. Phylogenetic tree  Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree – this is not always true Phylogeny / Slide 6 Nevin L. Zhang, HKUST Phylogenetic Trees  Phylogenetic trees Time giant panda lesser panda moose goshawk duck vulture alligator Current-day species at bottom Phylogeny / Slide 7 Nevin L. Zhang, HKUST Phylogenetic Trees    TAXA (sequences) identify species Edge lengths represent evoluation time Assumption: bifurcating tree toplogy AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Time Phylogeny / Slide 8 Nevin L. Zhang, HKUST Probabilistic Models of Evolution  Characterize relationship between taxa using substitution probability: – P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t t5 x5 t1 s1 x7 t2 s2 t3 s3 x6 t6 t4 s4 – P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), …. Phylogeny / Slide 9 Nevin L. Zhang, HKUST Probabilistic Models of Evolution   What should P(x|y, t) be? Two assumptions of commonly used models  There are only substitutions, no insertions/deletions (aligned) – One-to-one correspondence between sites in different sequences   Each site evolves independently and identically P(x|y, t) = Pi=1 to m P(x(i) | y(i), t)  m is sequence length AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT TAGACTT TAGCCCA AGCACAA AGCGCTT Phylogeny / Slide 10 Nevin L. Zhang, HKUST Probabilistic Models of Evolution  What should P(x(i)|y(i), t) be?  Jukes-Cantor (Character Evolution) Model [1969] – Rate of substitution a (Constant or parameter?)  A A rt C st G st T st C st rt st st G st st rt st T st st st rt rt = 1/4 (1 + 3e-4at) st = 1/4 (1 - e-4at) Limit values when t = 0 or t = infinity? Multiplicativity (lack of memory) P(c | a, t1  t2 )   P(b | a, t1 ) P(c | b, t2 ) b Phylogeny / Slide 11 Nevin L. Zhang, HKUST Tree Reconstruction  Given: collection of current-day taxa  Find: tree    Tree topology: T Edge lengths: t Maximum likelihood  AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT Find tree to maximize P(data | tree) AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 12 Nevin L. Zhang, HKUST Tree Reconstruction  When restricted to one particular site, a phylogenetic tree is an HLC model where  The structure is a binary tree and variables share the same state space.  The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization.  The model is the same for different sites AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 13 Nevin L. Zhang, HKUST Tree Reconstruction  Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT  Samples for HLC model. One Sample per site. The samples are i.i.d.  1st site: (A, T, T, A, A), 2nd site: (G, A, A, G, G),  3rd site: (G, G, G, C, C),   … AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 14 Nevin L. Zhang, HKUST Tree Reconstruction  Finding ML phylogenetic tree == Finding ML HLC model  Model space:   Model structures: binary tree where all variables share the same state space, which is known. Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters). Phylogeny / Slide 15 Nevin L. Zhang, HKUST Bayesian Networks for Classification  The problem:   Given data: Find mapping – (A1, A2, …, An) |- C  Possible solutions  ANN  Decision tree (Quinlan)  … A1 A2 … An C 0 1 1 0 T 1 0 1 1 F .. .. .. .. .. Phylogeny / Slide 16 Nevin L. Zhang, HKUST Bayesian Networks for Classification  Naïve Bayes model  From data, learn – P(C), P(Ai|C)  Classification – arg max_c P(C=c|A1=a1, …, An=an)  Very good in practice Phylogeny / Slide 17 Nevin L. Zhang, HKUST Bayesian Networks for Classification  Drawback of NB:    Attributes mutually independent given class variable Often violated, leading to doubling counting. Fixes:     General BN classifiers Tree augmented Naïve Bayes (TAN) models Hierarchical NB … Phylogeny / Slide 18 Nevin L. Zhang, HKUST Bayesian Networks for Classification  General BN classifier     Treat class variable just as another variable Learn a BN. Classify the next instance based on values of variables in the Markov blanket of the class variable. Pretty bad because it does not utilize all available information Phylogeny / Slide 19 Nevin L. Zhang, HKUST Bayesian Networks for Classification  TAN model  Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29:131-163.  Capture dependence among attributes using a tree structure.  During learning, – First learn a tree among attributes: use Chow-Liu algorithm – Add class variable and estimate parameters  Classification – arg max_c P(C=c|A1=a1, …, An=an) Phylogeny / Slide 20 Nevin L. Zhang, HKUST Bayesian Networks for Classification  Hierarchical Naïve Bayes models  N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear.  Capture dependence among attributes using latent variables  Detect interesting latent structures besides classification  Currently, slow

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document