* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt
Gene desert wikipedia , lookup
Koinophilia wikipedia , lookup
Genome (book) wikipedia , lookup
Primary transcript wikipedia , lookup
Human genome wikipedia , lookup
Population genetics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Adaptive evolution in the human genome wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression profiling wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Cre-Lox recombination wikipedia , lookup
Gene expression programming wikipedia , lookup
Minimal genome wikipedia , lookup
Point mutation wikipedia , lookup
Pathogenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Metagenomics wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
5 Open Problems in Bioinformatics •Pedigrees from Genomes •Comparative Genomics of Alternative Splicing •Viral Annotation •Evolving Turing Patterns •Protein Structure Evolution From genomes to pedigrees Coalescent Rebombination process Pedigree process Seqeunce/Individual Boundary Three Processes 1. Recombination 3. The Mutational Process From Yun Song 2. Choosing Parents Probability of Data given a pedigree. Elston-Stewart (1971) -Temporal Peeling Algorithm: Mother Father Condition on parental states Recombination and mutation are Markovian Lander-Green (1987) - Genotype Scanning Algorithm: Mother Father Condition on paternal/maternal inheritance Recombination and mutation are Markovian Comment: Obvious parallel to Wiuf-Hein99 reformulation of Hudson’s 1983 algorithm Benevolent Mutation and Recombination Process Genomes with r and m/r --> infinity r - recombination rate, m - mutation rate •Counting within a small interval would reveal the length of the path connecting the two segments. •Siblings are readily revealed, since they will have segments with 2m density of mutations •The distribution of path lengths are readily observable between two sequences •All embedded phylogenies are observable From Phylogenies to Pedigrees Mike’s counter example, linkage and individuals Different Pedigrees Same Phylogenies Pedigree 1 2 1 2 Sibling Sequences come from different parents. Pedigree 2 Individual 1 1 Gluing Phylogenies together 1 2 1 2 ? grandparents A recombinants’ parent are sister sequences. 1 2 1 2 1 2 1 2 Comparative Genomics of Alternative Splicing From Transcripts to the AS-Graph S S E E 1. How well known is the AS-graph as a function number of transcripts? 2. A family and distribution of transcripts, can they be explained an AS-graph with probabilities at donor sites or do we need probabilities for (donor,acceptor) pairs? Or possibly even more complicated situations. And is sampling transcripts good enough to distinguish these situations. Mini-project: reliability of AS-detection. Choose Idealized AS-Graph: 1. Genome 2. Choose donor and acceptor sites in random pairs. 3. For each possible splice pair assign probability for choosing it. This should define a probability for all transcripts. • Generate a set of transcripts. • Reconstruct AS-Graph. Key questions: 1. How many transcripts must be sampled to detect AS. 2. How well will the AS-Graph be recovered? Optimal DAG (directed acyclic graph) under restrictions Finding a set of annotations: 1. Find set of paths, maximizing sum of scores. Sub-optimal Paths: 2. The score of minimal path must be above threshold. Two paths must differ significantly: An enclosed area, the maximal height must be d higher than the boundary defining it. Height(i,j) = di,j + di,j 1. Does known AS genes have more CTO structure than non-AS genes? 2. Do the AS correspond to the CTO structure 3. Is the CTO structure evolutionary conserved? Phylogenetically related ASGs E E SS E E SS E E SS 1. Is ASG conserved? 2. What is conserved? 3. How is selection along position dependent on splicing status? Virus Annotation Classes of Gene Structures http://www.tulane.edu/~dmsander/WWW/335/Diarrhoea.html Diarrhoea Causing Arrangements Illustrating the 3 main classes of gene structures: Unidirectional, Convergent and Divergent. http://www.tulane.edu/~dmsander/WWW/335/Retroviruses.html http://www.tulane.edu/~dmsander/WWW/335/Papovaviruses.html Retroviridae Arrangements Papoviridae Arrangement The Problems of Viral Annotation •HMM gene structure generator (McCauley) •Gene Structure Evolution (de Groot) •Alignment (Caldeira, Lunter, Rocco) •Recombination (Lyngsø, Song) •Multiple constraints: RNA secondary structure, gene conservation, binding/transcriptional instructional sites. Our 8 State HMM which allows for Unidirectional overlapping gene structures HMM States • Non-coding • Coding RF1 • Coding RF2 • Coding RF3 • Coding RF1,2 • Coding RF1,3 • Coding RF2,3 • Coding RF1,2,3 Combining Levels of Selection. Assume multiplicativity: fA,B = fA*fB Protein-Protein Hein & Støvlbæk, 1995 Codon Nucleotide Independence Heuristic Jensen & Pedersen, 2001 Contagious Dependence Protein-RNA Singlet Doublets Contagious Dependence Table illustrating the performance benefit in Sensitivity we obtain utilizing a Phylogenetic HMM. We extend the HMM model to include evolutionary information from 13 aligned HIV2 sequences. HIV2 Genomes SS HMM Sensitivity PHMM Sensitivity Del Sensitivity SS HMM Specificity PHMM Specificity Del Specificity M30502 0.8913 0.9765 0.0852 0.9878 0.9753 -0.0125 J04542 0.8458 0.9173 0.0714 1.0000 0.9956 -.0044 D00835 0.8796 0.9432 0.0636 0.9920 0.9733 -0.0187 M15390 0.9310 0.9971 0.0661 1.0000 0.9869 -0.0131 J03654 0.8261 0.9971 0.1709 1.0000 0.9865 -0.0135 AY509259 0.8697 0.9256 0.0559 1.0000 0.9886 -0.0114 AY509260 0.8257 0.9101 0.0844 0.9928 0.9792 -0.0136 J04498 0.8961 0.9737 0.0776 1.0000 0.9911 -0.0089 AF082339 0.9074 0.9650 0.0577 0.9842 0.9773 -0.0069 U22047 0.9028 0.9874 0.0847 0.9865 0.9744 -0.0121 U27200 0.8769 0.9453 0.0684 0.9928 0.9748 -0.0180 LO7625 0.8340 0.9680 0.1340 1.0000 0.9607 -0.0393 L36874 0.8653 0.9957 0.1303 0.9980 0.9766 -0.0214 MEANS 0.8732 0.9617 0.0885 0.9949 0.9800 -0.0149 GenBank: Centralized resource for publicly available viral sequence data. Entrez Genomes currently contains 2120 Reference Sequences for 1510 viral genomes and 36 Reference Sequences for viroids. http://www.ncbi.nlm.nih.gov/Genbank/ http://www.ncbi.nlm.nih.gov/genomes/VIRUSES/viruses.html Properties of overlapping genes are conserved across microbial genomes. Genome Res. 2004 Nov;14(11):2268-72. Within microbial genomes, one third of annotated genes contain some degree of overlap, and one third of these are either Convergent or Divergent. Krakauer, D.C. Stability and evolution of overlapping genes. Evolution 54: 731-739 (2000) Genome Res. 2004 Nov;14(11):2268-72. General preponderance of overlapping gene structures is roughly a 90:9:1 ratio split across Unidirectional, Convergent and Divergent arrangements. Turing Patterns Mathematical models to understand biological patterns From Maini’s Home Page: http://www.maths.ox.ac.uk/~maini Turing Model u ( x, t ) 2 D u f (u, v,{ pu }) u t v( x, t ) Dv 2 v g (u, v, { pv }) t Different parameters lead to different patterns Stripes: p small Spots: p large [From: Leppanen et al. Dimensionality effects in Turing pattern formation, Int. J. Mod. Phys. B 17, 5541-5553 (2003)] ut Du 2u (u av - uv 2 - puv) v D 2 v (bv hu uv 2 puv) v t 3 suggestions 1. Networks and Turing Patterns 2. Stochastic Partial Differential Equations 3. Phylogenetically related Turing Patterns Evolutionary Models of Protein Structure Evolution ? ? ? ? Known a-globin Unknown 300 amino acid changes 800 nucleotide changes 1 structural change 1.4 Gyr Known Myoglobin 1. Given Structure what are the possible events that could happen? 2. What are their probabilities? Old fashioned substitution + indel process with bias. Bias: Folding(Sequence Structure) & Fitness of Structure 3. Summation over all paths. 2 suggestions A. Structure (Homology Modelling, Topology) Folding(Sequence Structure) As a first approximation similar structures should be compared and the problem could be solved by comparative modelling. Fast Homology Modelling Using Protein Topology as Hidden Variable Fitness of Structure – such functions are common place in guiding prediction programs. B. MCMC Questions to be asked Negative Note: Protein Structure Analysis is much harder than Sequence Analysis. Much of the first hand impression will remain: “Structures are either trivially similar or highly dissimilar” – the middle ground is empty. At Gyr scale other rearrangements occur. Positive Note: If it works Test of smooth/catastrophic structure evolution Separation of analogous/homologous similarities Protein Evolution in General How closely linked are homologous and structurally equivalent sites? http://www.biochem.ucl.ac.uk/bsm/cath/ http://scop.mrc-lmb.cam.ac.uk/scop/ Summary Pedigrees from Genomes Does infinite genomes determine pedigrees? How many pedigrees are there? Comparative Genomics of Alternative Splicing How well do you know the ASG? How do you measure selection on the ASG? Viral Annotation How well can you annotate viruses from observed evolution? Evolving Turing Patterns Turing Patterns and Networks Stochastic Turing Patterns Phylogenetically Related Turing Patterns Protein Structure Evolution Full Model of Structure Evolution Model of Protein Topology Evolution