* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Amsterdam 2004 - Theoretical Biology & Bioinformatics
Epigenetics of human development wikipedia , lookup
History of genetic engineering wikipedia , lookup
Point mutation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Public health genomics wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Copy-number variation wikipedia , lookup
Koinophilia wikipedia , lookup
Minimal genome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Gene therapy wikipedia , lookup
Genome (book) wikipedia , lookup
Segmental Duplication on the Human Y Chromosome wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Metagenomics wikipedia , lookup
Protein moonlighting wikipedia , lookup
Pathogenomics wikipedia , lookup
Gene desert wikipedia , lookup
Genome editing wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression profiling wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Helitron (biology) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression programming wikipedia , lookup
Bioinformatics and Evolutionary Genomics Gene Trees, Gene Duplications (I), and Orthology Gene Trees, Gene Duplications and Orthology Phylogenetic gene trees: how to make them • Homology: are two pieces of sequence related; Trees: when did they diverge (how are they related) • Start from a multiple sequence alignment • All multiple sequence programs alignments make a global alignment, thus feed it regions that you know are homologous → Domains ! • MUSCLE / clustal / t_coffee • Visual inspection of alignments (gaps, fragments/complete sequences, weird things e.g. A) Put homologs in the alignment • Even if they are not homologous MUSCLE will align them (muscle/clustalw implicitly “assumes” that the sequences you feed it are homologous) • And in a phylogeny program, non-homologous sequences will be clustered Visual inspection of alignments: ?! An additive tree which is wrongly reconstructed by UPGMA B A A B C D 5 6 2 1 D 3 1 C A 1 B 6 4 1 3 C 3 D 3 C 3 D 4 5 B A A B x 12 12 x 9 9 9 7 C 9 9 x 6 D 9 7 6 x A B CD A x 12 9 B 12 x 8 CD 9 8 x A A x BCD 10 BCD 10 x Neighbour-Joining (Saitou and Nei, 1987) • Global measure. keeps total branch length minimal • At each step, join two nodes such that distances are minimal (criterion of minimal evolution) • Leads to unrooted tree Neighbour-Joining At each step all possible “neighbour joinings” are checked and the one corresponding to the minimal total tree length (calculated by adding all branch lengths) is taken. Neighbour-Joining A B x 12 12 x 9 9 9 7 C 9 9 x 6 r= net divergence D 9 7 6 x r Mab = dab – (ra+rb)/(N-2) A 30 Mab = 12 – (30+28)/(4-2)) = -17 B 28 A B C D C 24 A x -17 -18 -17 AC → U D 22 B x -17 -18 dau = dac/2 + (ra-rc)/(2(N-2)) C x -17 = 9/2 + (30-24)/(2*2) = 6 D x dcu = dac - dau = 9 – 6 = 3 dbu = (dab + dbc – dac ) / 2 = (12 + 9–9)/2=6 ddu = (dad + dcd – dac ) / 2 = (9+ 6 – 9) / 2 = 3 A B 6 6 U C 3 3 D U B D U x 6 3 B 6 x 7 D r 3 9 7 13 x 10 U B D U B D x -16 -16 x -16 x e.g. UB →V Dvu = dub / 2 + (ru – rb )/ (2(N-2)) = 6/2 + (9-13)/(2*1) = 3 – 2 = 1 Dvb = dub – duv = 6 – 1 = 5 A Ddv = (dud +dbd –dub)/2 = (3+76 6)/2 = 2 B 5 U 3 C 1 V2 D Unequal rates between species are a very real phenomenon Character based: parsimony and maximum likelihood • Two way classification in phylogeny distance based vs character based • character state method. Searches “directly” (i.e. without defining distances) for a tree that fits best to the data (the alignment) Maximum likelihood • Search the tree with the highest maximum likelihood • one searches for the maximum likelihood (ML) value for the character state configurations among the sequences under study for each possible tree and chooses the one with the largest ML value as the preferred tree. Maximum likelihood • have to specify a model of sequence evolution • likelihood for all sites is the product of the likelihoods for individual sites assuming all the nucleotide sites evolve independently. • maximum likelihood method computes the probabilities for all possible combinations of ancestral states! • ML methods evaluate phylogenetic hypotheses n terms of the probability that a proposed model of the evolutionary process and the proposed unrooted tree (hypothesis) would give rise to the observed data (the alignment). The tree found to have the highest (log)ML value is considered to be the preferred tree. Interpreting trees (recurring theme) Interpreting the tree • Taxonomic findings • Paraphyly • Monophyly Interpreting the tree • Outgroup. place root between distant homologouss sequence and rest group (b) • Midpoint. place root at midpoint of longest path (sum of branches between any two leafs) NB njplot • Gene duplication. Place root between paralogous gene copies (b) • NB all affected by rates ! b Simple example (kinase) Two genes per species: how to differentiate between one ancient or two recent duplications? • Two genes in Human chromosomes ( Human A & Human B) & two genes in mouse chromosomes (Mouse A & Mouse B) Duplications, Speciations 1 2 3 ? Interpreting the tree: duplications vs speciations, going pseudo 3D Speciation Interpreting the tree: gene trees vs species trees Interpreting the tree Example: vertebrate duplications • Tetraploidy? Interpreting the tree: Horizontal Gene Transfer ( HGT ) Bacteria Eukarya Archaea Jargon for interpretation: Orthology (and paralogy) as a specification of homology when discussing two species human1 mouse1 human2 Fitch 1970 Two genes in two species are orthologous if they derive from one gene in their last common ancestor “the corresponding gene” Genes can diverge by Speciation, or Duplication “Gene duplication by cell division” implied to have the same function Orthology ~ annotating internal nodes as duplications or speciations Because of the definition, how does that translate to a tree With or without species phylogeny? Terminology: inparalogs, outparalogs, coorthologs Inparalogs Co-orthologs Outparalogs Importance of orthology for comparative genomics: more resolution Ec Hi Bs Af Af Ec Bs Mg Gene family present in Ec Hi Bs Mg Af Orthologs 1 present in Ec Hi Bs Af Orthologs 2 present in Ec Bs Mg Af Phenotype ~ gene correlation Func prediction if Hi is only biochem characterized enzyme Func prediction by co-oc Evolution of gene content: loss vs dupl Heurisitcs for orthology definition • Needed because – Speed (MSA plus reliable tree building is slow) – Difficulty in deciding of which things you should make a tree in the first place (PFAM?) – Difficulty in operationalizing nuanced tree orthology into group orthology • Historically bidirectional blast hits BBH BBH Ec1 Hi Bs1 Af Af Ec2Bs2 Mg Ec1Bs1 Ec1Bs2 Ec2 Bs1 Ec2 Bs2 Extracting tree-like information from pairwise similarities 50% 35% 33% 48% BBH issues 1: unequal rates prpC N. meningitidis 1:1 orthologs prpC E. coli prpC. P. aeruginosa . VCh1337 V cholerae mmgD B. subtilis mmgD B. halodurans citZ B. subtilis Outparalogs citZ B. halodurans . VCh2092 V. cholerae gltA P. multocida gltA E. coli gltA P. aeruginosa gltA N. meningitidis Duplication Speciation BBH issues 2: ignores inparalogs Ec1 Hi Bs1 Af Ec1 Hi 70% Ec2 Hi 38% Af Ec2Bs2 Bs3 Ec2 Bs2 48% Ec2 Bs3 51% (Bs2 Bs3 70%) Prevalence? Depends on e.g. evo distance, group vs pairwise orthology At least 16% prokaryotes INPARANOID BBH issues 3: differential gene loss Ec1 Hi Bs1 Af Af Ec2Bs2 Mg Mg Hi 35% Other Large Scale orthology schemes: Inparanoid Eric Sonnhammer Orthologous groups • Solution to the non-transitivity of the concept of orthology sensu stricto is: “Group orthology” • Conceptually: all proteins that are directly descended from one protein in the last common ancestor are considered orthologous to each other • Operationally: Combine all connected “best triangular hits” into Clusters of Orthologous Groups (COGs, Tatusov et al, 1997). WWW.NCBI.NLM.GOV (Watch out for fusion/fission though !!!) Large Scale orthology schemes: COG • 1. Perform the all-against-all protein sequence comparison. • 2. Detect and collapse obvious paralogs, that is, proteins from the same genome that are more similar to each other than to any proteins from other species. • 3. Detect triangles of mutually consistent, genomespecific best hits (BeTs), taking into account the paralogous groups detected at step 2. • 4. Merge triangles with a common side to form COGs. • 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into singledomain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. • 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. Large Scale orthology schemes: COG • 5. A case-by-case analysis of each COG. This analysis serves to eliminate false-positives and to identify groups that contain multidomain proteins by examining the pictorial representation of the BLAST search outputs. The sequences of detected multidomain proteins are split into single-domain segments and steps 1–4 are repeated with these sequences, which results in the assignment of individual domains to COGs in accordance with their distinct evolutionary affinities. • 6. Examination of large COGs that include multiple members from all or several of the genomes using phylogenetic trees, cluster analysis and visual inspection of alignments; as a result, some of these groups are split into two or more smaller ones that are included in the final set of COGs. Other Large Scale orthology schemes: Ortho MCL The too ambitious comparative genomics dilemma: duplication/speciation vs domains Domain composition, accretion Single Gene fusion Domains Gene Domain cassettes structural elements? present TIME Very distant past Gene Trivial orthologs~orthologs homologs Distant homologs Sequence divergence i.e. genome comparison between close species: no domain considerations, sub-sub-ortholog. Between distant Homologs, loads of domain considerations Implication of coupling between duplication & domain accretion for evolution and function prediction • for some genes life is easy 1:1:1 orthologs, no fusion / domains, couple of losses. But a minority of families but a large proportion of proteins is a formidable challenge, domains permutations and duplications make life complicated Orthology & function prediction Blast with a newly sequenced globin from frog What kind of globin is it? Globins Blast query Orthologous & function prediction vs homologous that are not orthologous & function • Orthologs tend to have the exact same molecular function, mere HTANO’s not • and operate in the same “pathway”. • Orthologs mostly have the same domain composition; … but inparalogs: fate after duplication: neofunctionalization or subfunctionalization • Even evolutionary true orthologs can have “different functions” • Both co-orthologs have taken over some aspect of the ancestral function and have lost other aspects • Acquiring of new function or loss-of-function: one of co-orthologs does something different now. Does retaining the ancestral “role” correlate with speed of sequence evolution: yes but a substantial minority is inconsistent 386 220 rfbB / rffG RfbB and RffG catalyze the same reaction, but are involved in two different biological processes. rfb gene cluster: biosynthesis of O-specific polysaccharides (inner membrane). rff gene cluster: complex biosynthesis of enterobacteria common antigen (outer membrane). Why do observe inconsistencies? Consistent Frequency (# cases) 70 Inconsistent 60 50 40 30 20 10 0 0 5 1 0 1 5 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 Sequence identity between inparalogs (%) Not because of chance due to lack of divergence time Why do observe inconsistencies? Similar sequence divergence of inparalogs relative to their singleortholog, molecular function similar? Any inconsistencies are then a chance outcome: both duplicates have diverged, but at (roughly) the same evolutionary speed (most amino acids substitutions are only been subject to purifying selection and not to adaptive selection) • In certain orthology scheme gene order is given prevalence above most similarity • Gene at conserved position is considered the “original” and the other duplicate the “copy”