Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computing and Biology One of the main impacts of computing on biology was to make possible a meaningful comparisons between large numbers of sequences. And as the numbers of sequences increased the methods got faster and more sophisticated. It is based on the idea that from one, or a small number of, self replicating sequences all current biological sequences have evolved. All genes are descended from other genes – although for the great majority of them it is no longer possible to trace the pattern of descent. However for the more recent evolutionary branchings we can trace the relationships between genes by studying their superficial similarity at a sequence level, or HOMOLOGY. Homology and Homologs Homology just means sequence similarity by virtue of a common evolutionary ancestor. >gi|24640218|ref|NP_572350.2| CG3126-PA, isoform A [Drosophila melanogaster] Length=1571 Score = 427 bits (1098), Expect = 6e-118 Identities = 223/415 (53%), Positives = 297/415 (71%), Gaps = 19/415 (4%) Frame = +2 Query 1901 SLVDHNEIMAKLTLKQEGDDGPDVRGGSGDILLVHATETDRKDLVLYFEAFLTTYRTFIT 2080 ++++ I L LK+ +DGP+V+GG D L+VHA+ + + EAF+TT+RTFI Sbjct 1151 NMLEEVNITRYLILKKREEDGPEVKGGYIDALIVHASRVQKVADNAFCEAFITTFRTFIQ 1210 Query 2081 PEELIQKLQYRYERF-CHFQDTFKQRVSKNTFFVLVRVVDELCLVEMTDEILKLLMELVF 2257 P ++I+KL +RY F C QD KQ+ +K TF +LVRVV++L ++T ++L LL+E V+ Sbjct 1211 PIDVIEKLTHRYTYFFCQVQDN-KQKAAKETFALLVRVVNDLTSTDLTSQLLSLLVEFVY 1269 Query 2258 RLVCKGELSLARILRKNILEKV---ENKRMLHHANS—-ALKPLAARGVAARPG------- 2401 +LVC G+L LA++LR +EKV + ++ + G+A G Sbjct 1270 QLVCSGQLYLAKLLRNKFVEKVTLYKEPKVYGFVGELGGAGSVGGAGIAGSGGCSGTAGG 1329 Query 2402 ----TLHDFHSLEIAEQLTLLDAELFYKIEIPEVLLWAKEQNEEKSPNLTQFTEHFNNMS 2569 +L D SLEIAEQ+TLLDAELF KIEIPEVLL+AK+Q EEKSPNL +FTEHFN MS Sbjct 1330 GNQPSLLDLKSLEIAEQMTLLDAELFTKIEIPEVLLFAKDQCEEKSPNLNKFTEHFNKMS 1389 Query 2570 YWVRSIIMLQEKAQDRERLLLKFIKIMKHLRKLNNFNSYLAILSALDSAPIRRLEWQKQT 2749 YW RS I+ + A++RE+ + KFIKIMKHLRK+NN+NSYLA+LSALDS PIRRLEWQK Sbjct 1390 YWARSKILRLQDAKEREKHVNKFIKIMKHLRKMNNYNSYLALLSALDSGPIRRLEWQKGI 1449 Query 2750 SEGLAEYCTLIDSSSSFRAYRAALAEVEPPCIPYLGLILQDLTFVHLGNPDHID-GKVNF 2926 +E + +C LIDSSSSFRAYR ALAE PPCIPY+GLILQDLTFVH+GN D++ G +NF Sbjct 1450 TEEVRSFCALIDSSSSFRAYRQALAETNPPCIPYIGLILQDLTFVHVGNQDYLSKGVINF 1509 Query 2927 SKRWQQFNILDSMRRFQQVHYEIRRNDEIISFFNDFSDHLAEEALWELSLKIKPR 3091 SKRWQQ+NI+D+M+RF++ Y RRN+ II FF++F D + EE +W++S KIKPR Sbjct 1510 SKRWQQYNIIDNMKRFKKCAYPFRRNERIIRFFDNFKDFMGEEEMWQISEKIKPR 1564 These two sequences, my Xenopus query sequence and the matching Drosophila sequence, show strong (and variable) homology, but even if we knew the function of the Drosophila gene it may not tell us much about the function of the Xenopus gene. Genes and Evolution - I Gene duplication though speciation The two copies of Gene A will now evolve independently, but will continue to have the same function They are ORTHOLOGS Genes and Evolution - II Gene duplication though internal genome duplication The two copies of Gene A will now evolve independently, but will probably not continue to have exactly the same function They are PARALOGS Homologs, orthologs & paralogs http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html Mutation and Evolution Translated part of mRNA sequence Ancestral sequence ATGAAGGCTGCCTACGACTGCCGTGCCAGAATGCTGAGG In species A ATGAAGGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTAAGG ATGAATGCTGCCTATGACTGCCGTG GAATGCTAAGG ATGAATGCAGCCTATGACTGCCGTG GAATGCTAAGG ATGAATGCAGCCTATGATTGCCGTG GAATGCTAAGG ATGAATGCAGCCTATGATTGCCGAG GAATGCTAAGG In species B ATGAAGGCTGCCTACGACTGCCGTGCCATAATGCTGAGG ATGAAGGCCGCCTACGACTGCCGTGCCATAATGCTGAGG ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGG ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGA ATGAAGGCCGCCTACGACTGTCGTGCCATAATCCTGAGA ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG ||||| || || || || || || | ||| || | | ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA MKAAYDCRARMLR MKAAYDCRARMLR MNAAYDCRARMLR MNAAYDCRARMLR MNAAYDCR GMLR MNAAYDCR GMLR MNAAYDCR GMLR MNAAYDCR GMLR MKAAYDCRAIMLR MKAAYDCRAIMLR MKAAYDCRAIMLR MKAAYDCRAIMLR MKAAYDCRAIILR MKAAYDCRAIILR MNAAYDCR-GMLR | |||||| +|| MKAAYDCRAIILR Searching for Similarity DNA comparison ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG ||||| || || || || || || | ||| || | | ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA amino acid comparison MNAAYDCR-GMLR | |||||| +|| MKAAYDCRAIILR The DNA sequence can change while the amino acid sequence stays the same, so always look for similarities by comparing amino acid sequences. We note that evolution causes sequence to change, by substitution, insertion or deletion, but not usually by small-scale re-ordering. So we need a tool which will find the ‘alignment’ between the two sequences which shows the greatest degree of similarity while introducing the fewest gaps as possible. The Downside of Gaps Take two random sequences, with no ‘real’ similarity: GACACTAGGTCGATGCGTGGTGGCGAGA ACGCATCCGGATGTGCACCGTGGAACTG And allow cost free gaps: GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA || | | | | | ||| |||| || ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG Clearly, although the alignment has no mismatches, it is obviously not biologically meaningful! The introduction of gaps into alignments must ideally reflect biological possibilities, but this is rather difficult. So the tendency is to make gaps ‘expensive’, and introduce them only when they make more long range matching happen than they introduce ‘un’-matching, e.g. TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA | || | || |||||||||||||||||||| ||||||||| ||| ||| | ||| | | | TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA |||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| |||| TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA The Essential Task Basically what we are trying to do, is to see whether we can work out the function of an unknown gene by comparing its sequence with those of genes in other species where we already know the function. We can do this because the sequence of most genes is conserved to some extent during evolution of different species. The problem is that while gene function is probably related to both its overall threedimensional structure and small regions of specific linear sequence, our only serious tool for discerning similarity between proteins is based firmly on long range linear sequence similarity. And there is no obvious requirement on genes to conserve sequence in order to conserve function – it’s just easier that way… But it seems clear that we can only expect this to be effective if we are looking at true ORTHOLOGS. Finding Orthologs So how do we find orthologs, and can we know when we have? The simplest is Reciprocal Best BLAST, but it implicitly relies on having all the protein sequences of you own organism, and the one you wish to find an ortholog in. frog protein database of human proteins best match human protein database of frog proteins x