Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP 538 Introduction of Bayesian networks Lecture 16: Wrap-Up Phylogeny / Slide 2 Nevin L. Zhang, HKUST Recap Latent class models Clustering Clustering criterion: conditional independence Drawback: Assumption too strong Hierarchical latent class (HLC) models Identifiability issues: regularity, equivalence Hill climbing algorithm Phylogeny / Slide 3 Nevin L. Zhang, HKUST Today Phylogenetic (evolution) trees Closely related to HLC models An example of viewing existing models in the framework of BN – Another example: HMM Interesting because – Ease understanding – Techniques in one field applied to another Structural EM for phylogenetic trees Dynamic BNs for speech understanding – Development of general purpose algorithms Bayesian networks for classification Hand waving only Phylogeny / Slide 4 Nevin L. Zhang, HKUST Phylogenetic Tree Outline Introduction to phylogenetic trees Probabilistic models of evolution Tree reconstruction Phylogeny / Slide 5 Nevin L. Zhang, HKUST Phylogenetic Trees Assumption Phylogeny All organisms on Earth have a common ancestor This implies that any set of species is related. The relationship between any set of species. Phylogenetic tree Usually, the relationship can be represented by a tree which is called a phylogenetic (evolution) tree – this is not always true Phylogeny / Slide 6 Nevin L. Zhang, HKUST Phylogenetic Trees Phylogenetic trees Time giant panda lesser panda moose goshawk duck vulture alligator Current-day species at bottom Phylogeny / Slide 7 Nevin L. Zhang, HKUST Phylogenetic Trees TAXA (sequences) identify species Edge lengths represent evoluation time Assumption: bifurcating tree toplogy AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Time Phylogeny / Slide 8 Nevin L. Zhang, HKUST Probabilistic Models of Evolution Characterize relationship between taxa using substitution probability: – P(x | y, t): probability that ancestral sequence y evolves into sequence x along an edge of length t t5 x5 t1 s1 x7 t2 s2 t3 s3 x6 t6 t4 s4 – P(X7), P(X5|X7, t5), P(X6|X7, t6), P(S1|X5, t1), P(S2|X5, t2), …. Phylogeny / Slide 9 Nevin L. Zhang, HKUST Probabilistic Models of Evolution What should P(x|y, t) be? Two assumptions of commonly used models There are only substitutions, no insertions/deletions (aligned) – One-to-one correspondence between sites in different sequences Each site evolves independently and identically P(x|y, t) = Pi=1 to m P(x(i) | y(i), t) m is sequence length AAGGCCT AAGACTT AGCACTT AAGGCAT AGGGCAT TAGACTT TAGCCCA AGCACAA AGCGCTT Phylogeny / Slide 10 Nevin L. Zhang, HKUST Probabilistic Models of Evolution What should P(x(i)|y(i), t) be? Jukes-Cantor (Character Evolution) Model [1969] – Rate of substitution a (Constant or parameter?) A A rt C st G st T st C st rt st st G st st rt st T st st st rt rt = 1/4 (1 + 3e-4at) st = 1/4 (1 - e-4at) Limit values when t = 0 or t = infinity? Multiplicativity (lack of memory) P(c | a, t1 t2 ) P(b | a, t1 ) P(c | b, t2 ) b Phylogeny / Slide 11 Nevin L. Zhang, HKUST Tree Reconstruction Given: collection of current-day taxa Find: tree Tree topology: T Edge lengths: t Maximum likelihood AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT Find tree to maximize P(data | tree) AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 12 Nevin L. Zhang, HKUST Tree Reconstruction When restricted to one particular site, a phylogenetic tree is an HLC model where The structure is a binary tree and variables share the same state space. The conditional probabilities are from the character evolution model, parameterized by edge lengths instead of usual parameterization. The model is the same for different sites AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 13 Nevin L. Zhang, HKUST Tree Reconstruction Current-day Taxa: AGGGCAT, TAGCCCA, TAGACTT, AGCACAA, AGCGCTT Samples for HLC model. One Sample per site. The samples are i.i.d. 1st site: (A, T, T, A, A), 2nd site: (G, A, A, G, G), 3rd site: (G, G, G, C, C), … AAGGCCT AAGACTT AGCACTT AGGGCAT AGCACAA TAGACTT TAGCCCA AGCGCTT Phylogeny / Slide 14 Nevin L. Zhang, HKUST Tree Reconstruction Finding ML phylogenetic tree == Finding ML HLC model Model space: Model structures: binary tree where all variables share the same state space, which is known. Parameterization: one parameter for each edge. (In general, P(x|y) has |x||y|-1 parameters). Phylogeny / Slide 15 Nevin L. Zhang, HKUST Bayesian Networks for Classification The problem: Given data: Find mapping – (A1, A2, …, An) |- C Possible solutions ANN Decision tree (Quinlan) … A1 A2 … An C 0 1 1 0 T 1 0 1 1 F .. .. .. .. .. Phylogeny / Slide 16 Nevin L. Zhang, HKUST Bayesian Networks for Classification Naïve Bayes model From data, learn – P(C), P(Ai|C) Classification – arg max_c P(C=c|A1=a1, …, An=an) Very good in practice Phylogeny / Slide 17 Nevin L. Zhang, HKUST Bayesian Networks for Classification Drawback of NB: Attributes mutually independent given class variable Often violated, leading to doubling counting. Fixes: General BN classifiers Tree augmented Naïve Bayes (TAN) models Hierarchical NB … Phylogeny / Slide 18 Nevin L. Zhang, HKUST Bayesian Networks for Classification General BN classifier Treat class variable just as another variable Learn a BN. Classify the next instance based on values of variables in the Markov blanket of the class variable. Pretty bad because it does not utilize all available information Phylogeny / Slide 19 Nevin L. Zhang, HKUST Bayesian Networks for Classification TAN model Friedman, N., Geiger, D., and Goldszmidt, M. (1997). Bayesian networks classifiers. Machine Learning, 29:131-163. Capture dependence among attributes using a tree structure. During learning, – First learn a tree among attributes: use Chow-Liu algorithm – Add class variable and estimate parameters Classification – arg max_c P(C=c|A1=a1, …, An=an) Phylogeny / Slide 20 Nevin L. Zhang, HKUST Bayesian Networks for Classification Hierarchical Naïve Bayes models N. L. Zhang, T. D. Nielsen, and F. V. Jensen (2002). Latent variable discovery in classification models. Artificial Intelligence in Medicine, to appear. Capture dependence among attributes using latent variables Detect interesting latent structures besides classification Currently, slow