Download Slide 1

Orthology predictions for whole mammalian genomes Leo Goodstadt MRC Functional Genomics Unit Oxford University “Evolution of Orthologues” Selection pressures in orthologues and paralogs “Gene Duplications” Reproduction, immunity or chemosensation “Gene birth in the human lineage” Ongoing duplications underlie polymorphism “Synonymous substitution rates” Mutation and selection varies by chromosome size Orthology is the key How it started We are “consumers” of orthology / paralogy Started off using Ensembl predictions Ensembl 1:1 covered 50% of predicted mouse genes. Ewan’s manual survey said 80% 1) General observations for all mammalian genomes Paralogues evolve fast (and are fun!) 2) Observations for whole clades of species 0.14 Drosophila Nematodes Amniotes Lineage specificdN/dS 0.12 0.10 0.08 0.06 0.04 0.02 0.00 Species 3) Inparalogues define lineage specific biology Marsupial / Monodelphis biology revealed by lineage specific genes • Chemosensation (OR, V1R and V2R ) • Reproduction (Vomeronasal Receptors, lipocalins, b-microseminoprotein (12:1)) • Immunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules ) pancreatic RNAses • Detoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets) • KRAB ZnFingers 4) Interesting stories in the aggregate 5) Treasure trove in the details On going mouse inparalogues analysis: Lots and lots of reproductive genes clade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade: !!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16. !!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699) gene identifier -------------------MUS_GENE_21705 ENSMUSP00000086007 MUS_GENE_22420 ENSMUSP00000099126 MUS_GENE_19599 < order ----6639 6643 6646 ( < NCBIMUSP_83776567 MUS_GENE_23688 ENSMUSP00000094421 MUS_GENE_19774 (E)-rich protein 3 ; 6651 6657 chrm exons stop length ---- ----- ---- -----5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 ; 4 182 5 predicted gene, EG623898 ; 2 72 5 spermatogenesis associated glutamate E)-rich protein 1, pseudogene 1 (Speer1-ps1) on chromosome 5 ; 4 157 5 predicted gene, EG623898 ; 2 72 5 spermatogenesis associated glutamate 6) Candidates for evolutionary and functional analyses Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res. Available Genomes And Divergences Hedges, SB Nature Reviews Genetics 3, 838 -849 (2002) How do we find function in the genome? • Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975). How to find the function in the genome? Similar Sequences(Genes / Genome regions) Common Ancestry (homology) Similar Structures / Folds Similar Functions ? How much of the genome is functional? Compare with the mouse Whole Genome ARs Ancestral Repetitive (AR) sequence is non-functional and has evenly distributed conservation scores (red) (symmetrical bell shaped due to biological variation) Whole Genome sequence contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection (asymetrical) Functional sequence =Whole Genome - Ancestral Repetitive = 5% N.B. This is an estimate that doesn’t take into account sequence •Turning over rapidly (not shared by mouse/human) •Under positive (diversifying) selection The human genome (euchromatic sequence) Protein coding: 1.2% UTR: 0.3% Conserved non-coding (3.5% ?) Neutral Repeats (Transposable elements, …) ~45% Unknown (old repetitive junk?) Conserved non-coding material • Transcription factor binding sites • Enhancers, insulators and other non-transcribed regulatory elements • Alternative splicing signals • Transfer RNAs, ribosomal RNAs • Small RNAs (e.g. snoRNAs, microRNAs, siRNAs and piRNAs) regulatory/gene silencing / RNA degradation • MacroRNAs (e.g. Xist) enzymatic? / chromosome inactivation Functional parts of genes are highly conserved How many protein coding genes? • Walter Gilbert [1980s] 100k • Antequera & Bird [1993] 70-80k • John Quackenbush et al. (TIGR) [2000] 120k • Ewing & Green [2000] 30k • Tetraodon analysis [2001] 35k • Human Genome Project (public) [2001] ~ 31k • Human Genome Project (Celera) [2001] 24-40k • Mouse Genome Project (public) [2002] 25k -30k • Lee Rowen [2003] 25,947 • Human Genome Project (finishing) 20-25k [2004] • Current predictions [2008] 19-20k Traditional Genome Orthology Reciprocal BLAST best hits between longest transcript of each gene (+ synteny) Assumes: • Protein similarity is proportional to evolutionary distance (selection is invariant!) • Pairwise relationships adequately represent the evolutionary tree • No gene losses or missing predictions • Alternative splicing can be ignored! • No gene translocations after tandem duplication Orthology prediction methods • Two genomes – Reciprocal best blast hit • Multiple genomes – Clustering of • reciprocal best hits • protein similarities Query Blast hits Reciprocal Blast Best Hits Advantages: • Fast, Well understood • Works well for distant lineages • Can correlate with protein structure (domains) Disadvantages: • Only provides 1:1 orthologues in the best case • Can be difficult to reconcile with the species tree Genes on chromosome of species 1 Genes on chromosome of species 2 Reciprocal Blast Best Hits ? Reciprocal Blast Best Hits ? How to add duplicated genes? synteny Ensembl compara in the past • Local gene order tends to be conserved in mammalian lineages • Look for inparalogs locally even if the protein distances don’t add up (sequence error, sampling error etc.) Blast Best Hits in Local Regions ? Blast Best Hits in Local Regions ? Problems with relying only on synteny Local homologs are often not inparalogs: •Local rearrangements •Missing predictions (neighbouring orphans) •Need sanity checking Human and Mouse chromosomes: •Extensive rearrangements only over larger regions •Conservation of gene order in the short range Olfactory Orthology from compara Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to one Many to many Olfactory Orthology Mouse chromosome 2 Rat chromosome 3 One to one One to many Many to one Many to many Inparanoid • Remm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 1041–1052. • Avoids multiple alignments and phylogenetic methods for speed and to avoid errors • Heuristics are implicitly phylogenetic How Inparanoid works Longest Transcripts Pairwise alignments scores 2. Use cutoff Reciprocal Best Hits are orthologues 3. Add lineage Specific duplicates (inparalogs) With confidences 4. Resolve conflicts 5. Orthology Identify Identify“inparalog” “main” orthologues candidates Longest Transcripts Pairwise alignments scores 2. Use cutoff Reciprocal Best Hits are orthologues 3. Add lineage Add lineage Specific duplicates Specific duplicates (inparalogs) (inparalogs) With confidences 4. Resolve conflicts 5. Orthology Confidence values for inparalogs A B 1. Most confident inparalog is when the inparalog is sequence identical to main orthologue. 2. Maximum value = scoreidentical – scoreorthologs 3. Confidence = (scoreinparalog – scoreorthologs) / (scoreidentical – scoreorthologs) Resolving conflicts Longest Transcripts 1. Merge if orthologs already clustered in same group Pairwise alignments scores 2. Use cutoff Reciprocal Best Hits are orthologues 2. Merge if two equally good best hits 3. Add inparalogs With confidences 3. Delete weaker group 4. Resolve conflicts 4. Merge significantly overlapping 5. Orthology 5. Divide overlapping Why are there conflicts? • Protein differences are a proxy for evolutionary time • Protein similarity scores approximate protein differences (sequence, alignment, estimation errors) • Pairwise scores can be used to (conceptually) recover phylogenetic (tree) data Alternatives: phylogenetic methods • Inparanoid is great because it models phylogeny explicitly • Why not use phylogenetic methods directly? • Multiple estimators of protein distance 4 pairwise scores used out of 30 Phylogenetic methods • Iterative distance methods are very fast, suitable for whole genome analyses (variants on neighbor joining) • Statistically consistent with evolutionary models (can have explicit error model with evolutionary distances, e.g. bionj) • Inparanoid type consistency checking can be carried out after phylogeny is predicted Is protein similarity a good proxy for evolutionary distance? Advantages • Does not saturate over long evolutionary distances • Easy to align / predict genes (unlike non-coding regions) • Sometimes cDNA sequence is not available Disadvantage • Assumes constant evolutionary rate • Assumes invariant selection Use Silent Mutations as a genetic clock • Redundant genetic code, e.g. GCA GCC → Alanine GCG GCT • Third base of a codon “wobbles” without changing the translated amino acid • dS approximates neutral mutation rate (without selection) in coding regions } dS as proxy for evolutionary distance • Easier to align than Ancestral Repeats • Not neutral sequence!! • Genomic > 2x variation in dS • Assumes most gene families are local due to tandem duplication and share dS • Assume (partial) gene conversions are infrequent dS Caveats • Saturates at long evolutionary distances (but less so than many think) • Beware of GC / codon frequency biases (use ML rather than heuristic methods) • Multiple alignment / tree rather than pairwise for best results • Slow to estimate accurately • Missing values (where dS saturates) codeml dS accuracy at 400 codons yn00 dS accuracy at 400 codons Use all transcripts PhyOP: transcript trees from dS 1. Whole genome alignment identifies homologues 2. codeml for dS calculation 3. Ignore large dS 4. Hierarchical cluster 5. Fitch Margoliash modified to handle missing values to give giant transcript tree 6. Heuristics based on lowest dS to select 1 “representative” transcript per gene 7. Map Gene tree to species tree Fitch Margoliash Minimize Where • dij is the pairwise distance estimate • pij is the distance between i and j on the tree Assumes that the error is a fixed proportion of the total distance (Fitch and Margoliash, 1967) Easily adapted for missing values PhyOP pipeline Part 1 3 ways in which transcript trees map to genes • Simple clades only 1 transcript per gene in orthologous relationship: most genes • Unambigous clades Alternative transcripts are in the same orthologous relationships 3 ways in which transcript trees map to genes • Ambiguous clades Alternative transcripts are in inconsistent relationships (small proportion) Where are most transcripts? Assumption: Most transcripts are not in any sort of orthologous relationships: their conjugates have not been predicted. Reality Most transcripts are in the same clade as their alternative transcripts: Because of shared exons, they are most similar to their alternatively transcribed siblings. How to choose between alternative transcripts? • Use conserved exon boundaries excludes exogenous sequence • Use distance to its ortholog (not tree distance because these will be equal) high dS means exogenous sequence and will be excluded With multiple partially overlapping clades, this is more difficult PhyOP pipeline Part 2 Example Four alternative transcripts (1-4), 6 dog genes, 3 human genes • Clade 1 transcripts Doga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1 • Clade 2 transcripts Dogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2 • Clade 3 transcripts Doga3 Dogb3 Dogd3 Doge3 Dogf3 Humana3 Humanb3 Humanc3 “Annointing” transcripts to keep: Example Circularity / boot-strapping problem: The transcript in the other species which is used for “annointing” might itself be discarded • Doga1 is closer to Humanb1 than Doga3 to any human transcript: Keep Doga1 discard Doga3 • Humanb3 is closer to Doge3 than Humanb1 to any dog transcript: Keep Humanb3 discard Humanb1 • Oops. Now no Human transcript is close to Doga1. • • • Clade 1 transcripts Doga1 Dogb1 Dogc1 Dogf1 Humanb1 Humanc1 Clade 2 transcripts Dogb2 Dogc2 Doge2 Dogf2 Humana2 Humanc2 Clade 3 transcripts Doga3 Dogb3 Dogd3 Doge3 Dogf3 Humana3 Humanb3 Humanc3 How to avoid circularity • • 1. 2. 3. 4. 5. Previously: Use mean distance to all other transcripts in the other species. Close eyes. Hope problem goes away. Now: Take all transcript pairs from all three clades starting with the closest dS “Annoint” both transcripts from the pair and throw away all other transcripts Ignore all pairs which involve discarded transcripts Recurse Complicated by trying to keep merged genes From transcripts to genes dS for orthologues dS distributions can be an indication of orthologue quality Dog vs. Human Genomes Conservation of Gene Order in Mouse / Rat ORs 1600 Rat OR Gene Order 1400 1200 1000 800 600 400 200 0 0 200 400 600 800 Mouse OR Gene Order 1000 1200 How to improve on using dS? • ds better dates the history, but fails for distant homologs. • dn works for distant homologs, but tends to be subjected to selective pressures. Can we combine them? • Full codon evolutionary model would account for this automatically • Use bootstrapping: if values -> random, no longer informative TreeBeST Tree Building guided by Species Tree http://treesoft.sourceforge.net/treebest.shtml Heng Li • Tree merge algorithm: merge several trees that are built from the same alignment with diﬀerent models. • Species-aware maximum likelihood: use species phylogeny to correct errors Maximize use of underlying data 5 tree types: 1. 2. 3. 4. 5. Synonymous distance NJ Non-Synonymous distance NJ P distance NJ WAG maximum likelihood HKY maximum likelihood Each predicted from same data Use bootstrap values to identify optimal branches using context free grammar Context Free Grammar in TreeBeST Given a set of binary rooted trees with the same leaf set V, reconstruct a binary rooted tree such that: • each branch of the resultant tree comes from one of the given trees • the resultant tree minimizes a certain objective function • additivity • topological independence Maximize use of underlying data • Switch automatically between – codon: dN, dS; – nucleotide: HKY and – protein: P-distance depending on bootstrap • Fix high probability errors by minimizing distance to species topology Slide from Heng Li Trees reconciled optimally Slide from Heng Li Is TreeBeST more reliable? Slide from Heng Li Caveats • Bootstrapping may not be the most effective way to test the support for a particular tree given the underlying data • The underlying data are not the state of the art but cannot use codon + ML for speed • Limited by multiple alignment • Reconciliation with species tree can mask real gene losses/duplications Alternative transcripts reveal merged genes • Ensembl includes merged genes 435 dog 346 human Finding merged genes What is the best way to deal with alternative transcripts? • Create virtual transcript Virtual translation What is the best way to deal with alternative transcripts? If two transcripts do not overlap and have homology to each other, they may be tandemly duplicated gene models merged in error Include both transcripts in pipeline How to run orthology pipeline for whole genomes • Take all proteins and cDNA • Make sure correspond exactly, no stop codons, no genomic mismatches • All vs all blastall • Protein-guided alignments of cDNA • Create virtual translation peptide • Run tree prediction. E.g. TreeBeST • Reconcile with species tree to derive orthology Predicting orthology gets easier with more genes/species 1. Phylogenetic methods improve in power with more data 2. Heuristic / pairwise methods decrease in power / become more ambiguous with more data Why is orthology prediction so hard for mammals? Because gene predictions is so hard The human genome (euchromatic sequence) Protein coding: 1.2% UTR: 0.3% Conserved non-coding (3.5% ?) Neutral Repeats (Transposable elements, …) ~45% Unknown (old repetitive junk?) Signals in DNA are weak • non-canonical splice sites • promotors without TATA box • introns/exons can have varying lengths • ... probabilistic models:  Hidden Markov Models Accuracy of ab-initio gene prediction • Nucleotide level: – 90% sensitivity/90% selectivity • Exon level: – 70% sensitivity/50% selectivity • Gene level: – 40% sensitivity/30% selectivity • False positives: difficult to refute • False negatives: will be missed Limitations of ab-initio models • • • • • Limited to training set Limited to model (strange genes) Problems with long genes Small exons are difficult to find Terminal exons are difficult to find – No splice signals, other signals variable • e.g. Genscan Comparative/homology methods • Add extra data to locate genes • Compare genome to known sequences – cDNAs – ESTs – Known protein sequences – e.g. Genewise } • Compare genome to other genome – e.g. TwinScan Same or different organism Using cDNAs/ESTs • cDNAs Provide 3'UTR and 5'UTR.  Provide full gene structure. – Expensive and thus rare – Contamination with genomic DNA  • ESTs Cheap and thus plentiful – Highly redundant – Of variable quality – Not complete  • Both: biased towards highly expressed genes Using cDNAs 5'UTR Exon Intron Exon Intron Exon 3'UTR cDNA sequence • Alignment between DNA sequences – Introns and reading frames Using known protein sequences 5'UTR Exon Intron Exon Intron Exon 3'UTR Alignment between a “cDNA” to a genome Implicit cDNA sequence Predicted protein sequence Alignment between two protein sequences Known protein sequence Using another genome sequence BLASTN results against Genome 2 Add evidence to ab-initio model e.g. TwinScan Genome 1 5'UTR Exon Intron Exon Intron Exon 3'UTR Align gene models between orthologous regions e.g. DoubleScan 5'UTR Exon Intron Exon Intron Genome 2 Exon 3'UTR Sweet spot for prediction by homology Sensitivity Ab-Initio Homology Specificity Homology Ab-Initio Similarity of known protein to target Guigo et al. (2000) Rate analyses Branches of gene trees scale symmetrically • Variations in branch length Ideal world 1.0 Cumulative frequency 0.8 0.6 Real world dana dere dmel dsec dsim dyak 0.4 0.2 0.0 0.0 0.5 1.0 1.5 2.0 Synonymous substitution rate / dS Median distance to root What orthologous genes should look like Sequence conservation between mouse and human genes Mouse genome paper Nature 420, 520-562 What orthologous genes should look like • Exons conserved between genomes • UTRs partially conserved between genomes CGSC (2004) Gene validations using orthology • Most genes have orthologues • Almost all genes have mammalian homologs • Exaption of non-coding sequence is rare, especially for constitutively expressed exons • Conservation of exon-intron structure (number and phase of exons) • Conservation of length • Conservation of domains • Conservation of synteny Look carefully at genes • For example: small introns Introns Pseudogene? Conservation of splice sites: • Insertions / losses of introns are rare • Phase Never changes • Aligned positions should nearly always match allowing for alignment errors • Valid mismatches may represent insertions (outside of protein domains) • Find retrogenes Conservation of splice sites: • Tandem duplication of non-coding may result in the appearance of splice site conservation • Check if sequence similarity is absolute • Check coding potential (Tandem duplicates are often fast evolving genes under positive selection) Retrogenes • Loss of introns is due to retrotransposition can be confirmed by loss of synteny (blastz) • Not all retrogenes are non-functional • Ancient ones are functional • Recent retrogenes can be assumed to be dead Gene validations using orthology Make sure orthology properties look appropriate Homo Monodelphis 1:1 orthologues dN /dS 0.086 1.02 dS Amino acid sequence identity Pairwise alignment coverage 81.0% 94.2% Homo sapiens Number of exons Sequence length (codons) Unspliced transcript length (bp) G+C content at 4D sites 9 471 27,241 56.9% Monodelphis domestica 9 445 25,365 48.7% What can you do with orthologs? Wait for part II

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1