Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Large scale genomes comparisons Bioinformatics aspects (Introduction) Fredj Tekaia Institut Pasteur [email protected] EMBO Bioinformatic and Comparative Genome Analysis Course Stazione Zoologica Anton Dohrn, Naples, Italy May 7 - 19, 2012 Large-scale genome comparisons: Comparing a genome (in terms of whole sequence, whole set of predicted genes or whole set of predicted proteins) to itself (intraspecies comparisons) or to another genome (inter-species comparisons). Large scale genome comparisons -Duplication; -Conservation; -Specificity (species-specific genes, proteins); -Paralogues, orthologues; -Families (clusters) of paralogues, of orthologues; -Genomes organisations (duplicated, conserved genes); -Search for shared motifs in proteins of the same cluster; -Protein conservation profiles; -Selection pressure analyses (synonymous, non synonymous substitutions,..),…. Evolution Speciation - Duplication G Duplication •Speciation •Duplication Time G1 G2 Speciation •Inparalogs •Orthologs Duplication •Outparalogs A-G1 A-G2 B-G1 B-G21 B-G22 •Loss of genes outparalogs outparalogs orthologs A inparalogs B Predict these events by comparing genomes? Orthologs / Paralogs • How to detect orthologous genes? - easy way: best reciprocal hit (RBH) 1a 2.1a 1b 2.1b 2.2a 2.2b 3a 3b Organism A Organism B • Large scale comparative analysis of predicted proteomes revealed significant evolutionary processes: Expansion, Exchange and Deletion. Evolutionary processes include Ancestor Expansion* Phylogeny* genesis duplication HGT species genome Exchange* selection* HGT loss Deletion* S. cerevisiae genome Colours reveal Duplications Kellis et al. Nature, 2004 Duplication Speciation Deletion Actual content of the 2 copies Reconstruction of the ancestral organization Kellis et al. Nature, 2004 Original version Actual version Hurles M (2004) Gene Duplication: The Genomic Trade in Spare Parts. PLoS Biol 2(7): e206. Genome duplication. a, Distribution of Ks values of duplicated genes in Tetraodon (left) and Takifugu (right) genomes. Duplicated genes broadly belong to two categories, depending on their Ks value being below or higher than 0.35 substitutions per site since the divergence between the two puffer fish (arrows). b, Global distribution of ancient duplicated genes (Ks > 0.35) in the Tetraodon genome. The 21 Tetraodon chromosomes are represented in a circle in numerical order and each line joins duplicated genes at their respective position on a given pair of chromosomes. Jaillon et al. Nature 431, 946-857. 2004. Search for similarity Methods: • Important to know how algorithms that allow sequence comparisons work, • There are many comparisons methods, • Among most used: • BLAST • FASTA • Smith-Waterman algorithm dynamic programming method • HMM (Hidden Markov Model) Sequence Comparaisons V I T K L G T C V G S V I S . . . T Q V G S • Identity • Similarity • Homology V I T K L G T C V G S V . S K . G T Q V . S Comparison of 2 sequences • Aims at finding the optimal alignment: the one that shows most similar regions and regions that are less similar. • In describing sequence comparisons, three different terms are commonly used : Identity, Similarity and Homology. Need for a score that evaluates: - matches - mismatches - gaps and a method that evaluates the numerous possible alignments. Homology • Sequence homology underlies common ancestry and sequence conservation; • Homology can be inferred, under suitable conditions from sequence similarity ; • The main objective of sequence similarity searching studies aims at inferring homology between sequences; • Homology is not a measure. It is an all or none relashionship (i.e homology exits or does not exist. Expressions like : significant or weak homology are meaningless!). Sequence similarity is a measure of the matching characters in an alignment, whereas homology is a statement of common evolutionary origin. Local Alignment A B Local alignment Global Alignment A B Global alignment Compare one query sequence to a BLAST formatted database Amino acid scoring schemes (substitution matrices) • All algorithms comparing protein sequences rely on some schemes to score the equivalence of each of the 210 possible pairs of amino acids. As a result : what a local alignment program produces depends strongly upon the scores it uses. • implicitly a scheme may represent a particular theory of evolution, • choice of a matrix can strongly influence the outcome of an analysis. •The scores in the matrix are integer values which assign a positive score to identical or similar character pairs, and a negative value to dissimilar character pairs. Sij = (ln(qij/pipj))/u; qij are target frequencies for aligned pairs of amino acids, the pi and pj are background frequencies, and u is a statistical parameter. BLOSUM62 Clustered Scoring Matrix in 1/2 Bit Units # Cluster Percentage: >= 62 # Lowest score = -4, Highest score = 11 A R N D C Q E G H I L K M F P S T W Y V B Z X * A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4 B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4 Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4 X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4 * -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1 • BLOSUM matrices (Henikoff, S., and Henikoff, J., G. (1992)) BlosumX denotes a matrix obtained from alignments of clustered sequence segments with more than X% identity. Examples : - Blosum62 is obtained from clustered sequences with identity greater than 62%. - Blosum80 is obtained from clustered sequences with identity greater than 80%. Which substitution matrix to choose? Blosum80 PAM10 Less divergent Blosum62 PAM120 <------ searching ------> Blosum45 PAM250 More divergent BLAST (Basic Local Alignment Search Tool) Nucleotide BLAST • Nucleotide query - nucleotide database [blastn] Protein BLAST • Protein query - protein database [blastp] • PSI-BLAST Position Specific Iterative BLAST Translated BLAST Searches • Nucleotide query - Protein db [blastx] • Protein query - Translated db [tblastn] • Nucleotide query - Translated db [tblastx] Seach for conserved domains • Search the Conserved Domain Database [RPS-BLAST] Pairwise BLAST • BLAST 2 Sequences Blast algorithm: (1) Query sequence: list of high scoring words of length w. Query Sequence of length L Maximum of L-w+1 words; w=3,11 ..... List the words that score at least T using a substitution matrix (Bosum62 or PAM250,...) (2) Compare the word list to the database and identify exact matches. DB sequences ..... Extract matches of words from word list. (3)For each word match, extend alignment in both directions to find alignments with scores > S Maximal Segment Pairs (MSPs): HSPs BLASTP 2.2.1 [Apr-13-2001] ............................ Query= YAL005c SSA1 heat shock protein of HSP70 family, cytosolic (642 letters) Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters ................................................ Sequences producing significant alignments: Score E (bits) Value YAL005c SSA1 heat shock protein of HSP70 family, cyt... YLL024c SSA2 heat shock protein of HSP70 family, cyt... YER103w SSA4 heat shock protein of HSP70 family, cyt... YBL075c SSA3 heat shock protein of HSP70 family, cyt... YJL034w KAR2 nuclear fusion protein YDL229w SSB1 heat shock protein of HSP70 family YNL209w SSB2 heat shock protein of HSP70 family, cyt... YJR045c SSC1 mitochondrial heat shock protein 70-rel... YEL030w heat shock protein of HSP70 family YLR369w SSQ1 mitochondrial heat shock protein 70 YBR169c SSE2 heat shock protein of the HSP70 family YPL106c SSE1 heat shock protein of HSP70 family YHR064c regulator protein involved in pleiotro... YKL073w LHS1 chaperone of the ER lumen YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... ................... 674 663 589 588 480 428 427 336 324 296 173 172 143 100 33 0.0 0.0 e-169 e-169 e-136 e-120 e-120 5e-93 2e-89 4e-81 7e-44 1e-43 6e-35 4e-22 0.13 >YLL024c SSA2 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic Length = 639 Score = 663 bits (2508), Expect = 0.0 Identities = 558/607 (91%), Positives = 570/607 (92%) Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MSKAVGIDLGTTYSCVAHF+NDRVDIIANDQGNRTTPSFV+FTDTERLIGDAAKNQAAMN Sbjct: 1 MSKAVGIDLGTTYSCVAHFSNDRVDIIANDQGNRTTPSFVGFTDTERLIGDAAKNQAAMN 60 .......................................................................... Query: 601 IMSKLYQ 607 IMSKLYQ Sbjct: 601 IMSKLYQ 607 >YER103w SSA4 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic Length = 642 Score = 589 bits (2224), Expect = e-169 Identities = 473/609 (77%), Positives = 539/609 (87%), Gaps = 3/609 (0%) Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MSKAVGIDLGTTYSCVAHFANDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAAMN Sbjct: 1 MSKAVGIDLGTTYSCVAHFANDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAMN 60 .................................................................... Query: 598 ANPIMSKLY 606 ANPIMSK+Y Sbjct: 601 ANPIMSKFY 609 >YBL075c SSA3 P14.1.f13.1 heat shock protein of HSP70 family, cytosolic Length = 649 Score = 588 bits (2220), Expect = e-169 Identities = 467/609 (76%), Positives = 539/609 (87%), Gaps = 3/609 (0%) Query: 1 MSKAVGIDLGTTYSCVAHFANDRVDIIANDQGNRTTPSFVAFTDTERLIGDAAKNQAAMN 60 MS+AVGIDLGTTYSCVAHF+NDRV+IIANDQGNRTTPS+VAFTDTERLIGDAAKNQAA+N Sbjct: 1 MSRAVGIDLGTTYSCVAHFSNDRVEIIANDQGNRTTPSYVAFTDTERLIGDAAKNQAAIN 60 ........................................ Query: 598 ANPIMSKLY 606 ANPIM+K+Y Sbjct: 601 ANPIMTKFY 609 >YJL034w KAR2 682 P14.1.f13.1 nuclear fusion protein Length = ........................................... Large-scale proteome comparisons Systematic comparisons Comparenewg2eachg ng list Compareeachg2newg ng list blastp, blosum62, SEG filter bestgs1ng allgs1ng bestgs2ng allgs2ng NG ro new proteome bestgsnng allgsnng bestnggs i NG1 size GSij blast p GS 1 proteome1 bestnggs1 allnggs1 GS 2 proteome2 bestnggs2 allnggs2 GS n proteomen HS/IS/NS - fast determination of significant matches; allnggsi NG1 size GSij blast p NG2 size GSik blast p multiple matches; bestnggsn allnggsn HS/IS/NS HS/IS/NS orthologs determination; The expected number of HSPs with score at least S is given by: E = Kmne-S. m and n are sequence and database lengths. Systematic Analysis of Completely Sequenced Organisms • In silico species specific comparisons; • Degree of ancestral duplication and of ancestral conservation between pairs of species; • Families of paralogs (Partition-mcl); • Families of orthologs (Partition-mcl); • Determination of the protein dictionary (orthologs); • Determination of protein conservation profiles; Working Examples Comparing S. cerevisiae (SC) genome with C. elegans (CE) genome SC vs SC BLASTP 2.2.1 [Apr-13-2001] ............................ Query= YAL005c SSA1 heat shock protein of HSP70 family, cytosolic (642 letters) Database: S. cerevisiae proteome version 22/05/2002 5829 sequences; 2,798,770 total letters ................................................ Sequences producing significant alignments: Score E (bits) Value YAL005c SSA1 heat shock protein of HSP70 family, cyt... YLL024c SSA2 heat shock protein of HSP70 family, cyt... YER103w SSA4 heat shock protein of HSP70 family, cyt... YBL075c SSA3 heat shock protein of HSP70 family, cyt... YJL034w KAR2 nuclear fusion protein YDL229w SSB1 heat shock protein of HSP70 family YNL209w SSB2 heat shock protein of HSP70 family, cyt... YJR045c SSC1 mitochondrial heat shock protein 70-rel... YEL030w heat shock protein of HSP70 family YLR369w SSQ1 mitochondrial heat shock protein 70 YBR169c SSE2 heat shock protein of the HSP70 family YPL106c SSE1 heat shock protein of HSP70 family YHR064c regulator protein involved in pleiotro... YKL073w LHS1 chaperone of the ER lumen YLR135w subunit of SLX1P/Ybr228p-SLX4P complex... ................... 674 663 589 588 480 428 427 336 324 296 173 172 143 100 33 0.0 0.0 e-169 e-169 e-136 e-120 e-120 5e-93 2e-89 4e-81 7e-44 1e-43 6e-35 4e-22 0.13 bestscsc YAL002w YAL003w YAL004w YAL005c YAL007c ( SC / SC ) 1176 206 215 642 215 allscsc YLL024c YOR016c NS NS NS HS 0.0 HS 1e-44 ( SC / SC ) YAL002w 1176 - NS YAL003w 206 - NS YAL004w 215 - NS YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c 642 642 642 642 642 642 642 642 642 642 642 642 642 YLL024c YER103w YBL075c YJL034w YDL229w YNL209w YJR045c YEL030w YLR369w YBR169c YPL106c YHR064c YKL073w HS HS HS HS HS HS HS HS HS HS HS HS HS 0.0 0.0 0.0 e-147 e-130 e-130 e-100 2e-96 1e-87 2e-47 4e-47 7e-38 5e-24 YAL007c YAL007c YAL007c YAL007c 215 215 215 215 YOR016c YGL200c YHR110w YDL018c HS IS IS IS 1e-44 5e-05 0.017 0.021 - Paralogs - multiple matches - Partitions/clustering Multiple matches of sc in sc ORF matches in sc YAL005c 13 YAL007c 1 YDR214w 1 YDR216w 2 YDR399w 1 YDR406w 9 YDR409w 1 YCR040w 1 YKL218c 1 YKL219w 14 YKL220c 6 YKL221w 2 YKL222c 3 YKL223w 5 YKL224c 22 YKR001c 2 YKR003w 5 YBR104w 6 YBR105c 1 YKR013w 2 YKR014c 13 .................................... .......................... Max : YDR477w 77 SC/CE bestscce YAL002w YAL003w YAL004w YAL005c YAL007c YAL009w YAL019w YAL020c YAL021c CE/SC (SC / CE) 1176 206 215 642 215 259 1131 333 837 allscce bestcesc C42C1.4 F54H12.6 F26D10.3 F57B10.5 F16D3.7 M03C11.8 F07C3.4 ZC518.3 HS HS NS HS HS IS HS IS HS 2e-15 4e-22 e-172 9e-08 0.013 7e-92 7e-04 5e-47 (SC / CE) 1259 213 640 640 203 516 1038 356 949 425 600 allcesc YAL002w 1176 C42C1.4 HS 2e-15 YAL003w 206 YAL003w 206 F54H12.6 HS Y41E3.10 HS 4e-22 2e-17 YAL004w 215 - NS YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c F26D10.3 F44E5.4 F44E5.5 C12C8.1 C15H9.6 F43E2.8 C37H5.8 F11F1.1 F54C9.2 K09C4.3 T28F3.2 C30C11.4 T24H7.2 T14G8.3 HS HS HS HS HS HS HS HS HS HS HS HS HS HS 642 642 642 642 642 642 642 642 642 642 642 642 642 642 C42C1.4 F54H12.6 F26D10.3 F26D10.3 F57B10.5 F16D3.7 M03C11.8 AC3.1 AC3.2 AC3.3 AC3.4 ( CE / SC) e-172 e-153 e-153 e-152 e-148 e-144 e-104 1e-77 4e-51 4e-47 2e-45 7e-43 2e-34 8e-33 YAL002w YAL003w YER103w YER103w YAL007c YHL003c YAL019w YLR189c YNL326c HS HS HS HS HS IS HS NS IS NS HS 8e-16 4e-20 e-174 e-174 7e-13 9e-04 2e-87 0.038 1e-12 (CE / SC ) C42C1.4 1259 YAL002w HS 8e-16 F54H12.6 213 YAL003w HS 4e-20 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 F26D10.3 640 640 640 640 640 640 640 640 640 640 640 640 640 640 YER103w YBL075c YLL024c YAL005c YJL034w YDL229w YNL209w YJR045c YEL030w YLR369w YPL106c YBR169c YHR064c YKL073w HS HS HS HS HS HS HS HS HS HS HS HS HS HS e-174 e-174 e-172 e-171 e-141 e-129 e-129 e-100 2e-97 1e-83 2e-45 5e-45 8e-36 3e-22 Reciprocal Best Hits (RBH) segmatchSCCE Test siz Hit siz e-val %id %sim gap Ssiz dT eT dH eH YAL002w 1176 C42C1.4 1259 5e-14 16 44 7 674 438 1111 547 1196 YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c YAL005c 642 642 642 642 642 642 642 642 642 642 642 642 642 642 642 F26D10.3 F44E5.5 F44E5.4 C12C8.1 C15H9.6 F43E2.8 C37H5.8 F11F1.1b F11F1.1a F54C9.2 K09C4.3 K09C4.3 C30C11.4 T24H7.2 T14G8.3 640 645 645 643 661 657 657 607 614 469 310 310 776 925 926 1e-159 1e-142 1e-142 1e-141 1e-137 1e-134 1e-96 1e-73 8e-72 3e-47 2e-43 1e-04 1e-39 1e-31 3e-30 73 63 63 62 60 58 46 36 36 38 71 54 26 24 24 84 79 79 79 78 76 67 60 60 66 88 70 50 50 51 0 0 0 0 1 1 2 0 2 2 0 605 3 607 5 613 605 3 607 5 611 605 3 607 5 611 605 3 607 5 611 603 5 607 36 641 606 1 606 29 637 606 2 607 31 632 599 4 602 2 600 599 4 602 2 607 379 2 380 52 433 186 4 189 6 192 61 327 387 189 249 8 600 5 604 4 647 3 506 4 509 26 548 6 510 4 513 28 560 Conclusion Large-scale analyses of Completely sequenced genomes allow a systematic vision of genes and genome organization and their macro as well their micro evolutions. Starting step for sophiticated evolutionary analyses that will be dealt with during this course. Practical sessions (see text)