Download 2.orthologs-and-othe..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Computing and Biology
One of the main impacts of computing on biology was to make
possible a meaningful comparisons between large numbers of
sequences.
And as the numbers of sequences increased the methods got
faster and more sophisticated.
It is based on the idea that from one, or a small number of, self
replicating sequences all current biological sequences have
evolved.
All genes are descended from other genes – although for the great
majority of them it is no longer possible to trace the pattern of
descent.
However for the more recent evolutionary branchings we can trace
the relationships between genes by studying their superficial
similarity at a sequence level, or HOMOLOGY.
Homology and Homologs
Homology just means sequence similarity by virtue of a common
evolutionary ancestor.
>gi|24640218|ref|NP_572350.2|
CG3126-PA, isoform A [Drosophila melanogaster]
Length=1571
Score = 427 bits (1098), Expect = 6e-118
Identities = 223/415 (53%), Positives = 297/415 (71%), Gaps = 19/415 (4%)
Frame = +2
Query 1901 SLVDHNEIMAKLTLKQEGDDGPDVRGGSGDILLVHATETDRKDLVLYFEAFLTTYRTFIT 2080
++++
I
L LK+ +DGP+V+GG D L+VHA+
+
+ EAF+TT+RTFI
Sbjct 1151 NMLEEVNITRYLILKKREEDGPEVKGGYIDALIVHASRVQKVADNAFCEAFITTFRTFIQ 1210
Query 2081 PEELIQKLQYRYERF-CHFQDTFKQRVSKNTFFVLVRVVDELCLVEMTDEILKLLMELVF 2257
P ++I+KL +RY F C QD KQ+ +K TF +LVRVV++L
++T ++L LL+E V+
Sbjct 1211 PIDVIEKLTHRYTYFFCQVQDN-KQKAAKETFALLVRVVNDLTSTDLTSQLLSLLVEFVY 1269
Query 2258 RLVCKGELSLARILRKNILEKV---ENKRMLHHANS—-ALKPLAARGVAARPG------- 2401
+LVC G+L LA++LR
+EKV
+ ++
+
G+A
G
Sbjct 1270 QLVCSGQLYLAKLLRNKFVEKVTLYKEPKVYGFVGELGGAGSVGGAGIAGSGGCSGTAGG 1329
Query 2402 ----TLHDFHSLEIAEQLTLLDAELFYKIEIPEVLLWAKEQNEEKSPNLTQFTEHFNNMS 2569
+L D SLEIAEQ+TLLDAELF KIEIPEVLL+AK+Q EEKSPNL +FTEHFN MS
Sbjct 1330 GNQPSLLDLKSLEIAEQMTLLDAELFTKIEIPEVLLFAKDQCEEKSPNLNKFTEHFNKMS 1389
Query 2570 YWVRSIIMLQEKAQDRERLLLKFIKIMKHLRKLNNFNSYLAILSALDSAPIRRLEWQKQT 2749
YW RS I+ + A++RE+ + KFIKIMKHLRK+NN+NSYLA+LSALDS PIRRLEWQK
Sbjct 1390 YWARSKILRLQDAKEREKHVNKFIKIMKHLRKMNNYNSYLALLSALDSGPIRRLEWQKGI 1449
Query 2750 SEGLAEYCTLIDSSSSFRAYRAALAEVEPPCIPYLGLILQDLTFVHLGNPDHID-GKVNF 2926
+E + +C LIDSSSSFRAYR ALAE PPCIPY+GLILQDLTFVH+GN D++ G +NF
Sbjct 1450 TEEVRSFCALIDSSSSFRAYRQALAETNPPCIPYIGLILQDLTFVHVGNQDYLSKGVINF 1509
Query 2927 SKRWQQFNILDSMRRFQQVHYEIRRNDEIISFFNDFSDHLAEEALWELSLKIKPR 3091
SKRWQQ+NI+D+M+RF++ Y RRN+ II FF++F D + EE +W++S KIKPR
Sbjct 1510 SKRWQQYNIIDNMKRFKKCAYPFRRNERIIRFFDNFKDFMGEEEMWQISEKIKPR 1564
These two sequences, my
Xenopus query sequence
and the matching
Drosophila sequence,
show strong (and variable)
homology, but even if we
knew the function of the
Drosophila gene it may not
tell us much about the
function of the Xenopus
gene.
Genes and Evolution - I
Gene duplication
though speciation
The two copies of
Gene A will now evolve
independently, but will
continue to have the
same function
They are
ORTHOLOGS
Genes and Evolution - II
Gene duplication
though internal
genome duplication
The two copies of
Gene A will now evolve
independently, but will
probably not continue
to have exactly the
same function
They are PARALOGS
Homologs, orthologs & paralogs
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/Orthology.html
Mutation and Evolution
Translated part of mRNA sequence
Ancestral sequence
ATGAAGGCTGCCTACGACTGCCGTGCCAGAATGCTGAGG
In species A
ATGAAGGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG
ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTGAGG
ATGAATGCTGCCTATGACTGCCGTGCCAGAATGCTAAGG
ATGAATGCTGCCTATGACTGCCGTG
GAATGCTAAGG
ATGAATGCAGCCTATGACTGCCGTG
GAATGCTAAGG
ATGAATGCAGCCTATGATTGCCGTG
GAATGCTAAGG
ATGAATGCAGCCTATGATTGCCGAG
GAATGCTAAGG
In species B
ATGAAGGCTGCCTACGACTGCCGTGCCATAATGCTGAGG
ATGAAGGCCGCCTACGACTGCCGTGCCATAATGCTGAGG
ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGG
ATGAAGGCCGCCTACGACTGTCGTGCCATAATGCTGAGA
ATGAAGGCCGCCTACGACTGTCGTGCCATAATCCTGAGA
ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA
ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG
||||| || || || || || || |
||| || | |
ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA

MKAAYDCRARMLR







MKAAYDCRARMLR
MNAAYDCRARMLR
MNAAYDCRARMLR
MNAAYDCR GMLR
MNAAYDCR GMLR
MNAAYDCR GMLR
MNAAYDCR GMLR






MKAAYDCRAIMLR
MKAAYDCRAIMLR
MKAAYDCRAIMLR
MKAAYDCRAIMLR
MKAAYDCRAIILR
MKAAYDCRAIILR
MNAAYDCR-GMLR
| |||||| +||
MKAAYDCRAIILR
Searching for Similarity
DNA comparison
ATGAATGCAGCCTATGATTGCCGAG---GAATGCTAAGG
||||| || || || || || || |
||| || | |
ATGAAGGCCGCATACGACTGTCGTGCCATAATCCTGAGA
amino acid comparison
MNAAYDCR-GMLR
| |||||| +||
MKAAYDCRAIILR
The DNA sequence can change while the amino acid sequence
stays the same, so always look for similarities by comparing amino
acid sequences.
We note that evolution causes sequence to change, by substitution,
insertion or deletion, but not usually by small-scale re-ordering.
So we need a tool which will find the ‘alignment’ between the two
sequences which shows the greatest degree of similarity while
introducing the fewest gaps as possible.
The Downside of Gaps
Take two random sequences, with no ‘real’ similarity:
GACACTAGGTCGATGCGTGGTGGCGAGA
ACGCATCCGGATGTGCACCGTGGAACTG
And allow cost free gaps:
GAC--ACT----AGGTCGATGC---GTGG---TGGCGAGA
|| | |
| | | |||
||||
||
ACGCA-TCCGGA--T-G-TGCACCGTGGAACTG
Clearly, although the alignment has no mismatches, it is obviously not biologically
meaningful!
The introduction of gaps into alignments must ideally reflect biological possibilities,
but this is rather difficult. So the tendency is to make gaps ‘expensive’, and introduce
them only when they make more long range matching happen than they introduce
‘un’-matching, e.g.
TTCCCAACTCTCCTCTTTCACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA
| ||
|
|| |||||||||||||||||||| ||||||||| ||| |||
|
|||
| | |
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTCCCCCAAAATCAAGCGCACCCCGTCCCAGAA
TTCCCAACTCTCCTCTTT=CACCATGAAGCTCAAGGACAGATTCCACTCGCCCCAAAATCAAGCTCACCCCGTCCAAGAA
|||||| ||||||||||| |||||||||||||||||||| ||||||||| |||||||||||||| |||||||||| ||||
TTCCCACCTCTCCTCTTTGCACCATGAAGCTCAAGGACAAATTCCACTC=CCCCAAAATCAAGCGCACCCCGTCCCAGAA
The Essential Task
Basically what we are trying to do, is to see whether we can
work out the function of an unknown gene by comparing its
sequence with those of genes in other species where we
already know the function.
We can do this because the sequence of most genes is conserved to some extent
during evolution of different species.
The problem is that while gene function is probably related to both its overall threedimensional structure and small regions of specific linear sequence, our only serious
tool for discerning similarity between proteins is based firmly on long range linear
sequence similarity.
And there is no obvious requirement on genes to conserve sequence in order to
conserve function – it’s just easier that way…
But it seems clear that we can only expect this to be effective if
we are looking at true ORTHOLOGS.
Finding Orthologs
So how do we find orthologs, and can we know when we have?
The simplest is Reciprocal Best BLAST, but it implicitly relies on having all
the protein sequences of you own organism, and the one you wish to find
an ortholog in.
frog protein
database of
human proteins
best match
human protein
database of
frog proteins
x
Related documents