Download Dotplot: Et protein oppbygd av moduler som ligner hverandre

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Valg av poengverdier (substitusjonsmatrise) er
viktig

Scoring matrices appear in all analysis
involving sequence comparison.
 The choice of matrix can strongly influence
the outcome of the analysis.
 Scoring matrices implicitly represent a
particular theory of evolution.
 Understanding theories underlying a given
scoring matrix can aid in making proper
choice.
Forskjellige prinsipper for
substitusjonsmatriser




Identity matrix
Genetic Code Matrix: Score based on minimum number of base
changes required to convert one amino acid into another.
Physical/ chemical characteristics. Attempt to quantify some physical or
chemical attribute of the residues and arbitrarily assign weights based
on similarities of the residues
Log odds matrices
S is the log odds ratio of two probabilities: the probability that two
residues, i and j, are aligned by evolutionary descent and the
probability that they are aligned by chance.
qij are the frequencies that residue i and j are observed to align in
sequences known to be related. They are derived from a "transition
probability matrix.”
pi and pj are frequencies of occurrence of residue i and j in the set of
sequences.
e. g., PAM250, BLOSUM62 et al.
PAM-matriser: Hvordan ble de konstruert av
Margaret Dayhoff?
Align sequences that are at
least 85% identical (minimize
ambiguity in alignments,
minimize the number of
coincident mutations.
2. Reconstruct phylogenetic trees
and infer ancestral sequences.
71 trees containing 1,572
exchanges were used.
3. Count replacements "accepted"
by natural selection, in all
pairwise comparisons (each Aij
is the number of times amino
acid j was replaced by amino
acid i in all comparisons).
4. Compute amino acid mutability
mj , i. e., the propensity of a
given amino acid, j, to be
replaced.
1.
PAM-konstruksjon, forts.
5.
Combine data from 3 & 4 to produce a Mutation Probability Matrix
for one PAM of evolutionary distance (1 PAM (Accepted Point
Mutation per 100 residues)), according to the following formulae:
6.
Calculate Log Odds Matrix for similarity scoring: Divide each
element of the Mutation Data Matrix, M, by the frequency of
occurrence of each residue:
R is a Relatedness Odds Matrix , fi is the frequency of
residue i.
The Log Odds Matrix, Sij, is calculated from the
relatedness odds matrix, Rij, simply by taking the log of
each Rij and multiplying with 10
PAM 250 substitution matrix
Limitations of the PAM model
Assumptions in PAM model:
1.
replacement at any site depends only on the amino acid at that site and the
probability given by the table (Markov model).
2.
sequences that are being compared have average amino acid composition.
Sources of error in PAM model
1.
Many sequences depart from average composition.
2.
Rare replacements were observed too infrequently to resolve relative
probabilities accurately (for 36 pairs no replacements were observed!).
3.
Errors in 1 PAM are magnified in the extrapolation to 250 PAM.
4.
The Markov process is an imperfect representation of evolution: Distantly
related sequences usually have islands (blocks) of conserved residues. This
implies that replacement is not equally probable over entire sequence.
BLOSUM (Blocks Substitution Matrix)
substitusjonsmatriser
1. Starting data is conserved blocks from Blocks database.
 aligned, ungapped sequences
 widely varying similarity, but measures are taken to avoid
biasing the sample with frequently occurring highly related
sequences.
2. Counts of replacements are made by straight forward counting of
all pairs of aligned residues, fij
 The observed frequency of each pair is:
qij= fij/( total number of residue pairs)
 This includes cases of i= j (i. e. no replacement observed).
 The expected frequency of each pair is essentially the product
of the frequencies of each residue in the data set.
BLOSUM (Blocks Substitution Matrix)
substitusjonsmatriser
3. Similar sequences in a block above a threshold
percent similarity are clustered and members of the
cluster count fractionally toward the final tally.
– Reduces the number of identical pairs (AA, SS, TT, etc.,
matches) in the final tallies.
– Somewhat analogous to increasing the PAM distance.
– If clustering threshold is 80%, final matrix is BLOSUM 80.
– Clustering at 62% reduces the number of blocks contributing
to the table by 25%- still 1.25 x 10^ 6 pairs contributed!
– Least frequent amino acid pair replacement was observed
2369 times!
BLOSUM 62
Blosum og PAM – en sammenligning
FASTA og BLAST: søk etter
beslektede sekvenser i databasene
Søk i databasene med en rigorøs Smith-Watermanalgoritme er ressurskrevende (men mulig). FASTA og
BLAST gir raskere søk og mindre ressursbruk ved å
benytte snarveier. For begge gjelder det at det foretas en
forhånds-”siling” av sekvensene i databasen slik at bare
sekvenser som ser interessante ut (ser ut til å ligne på
søkesekvensen) behandles videre
Slik arbeider FASTA
s =
1 2 3 4 5 6 7 8 9 10 11
H A R F Y A A Q I V L
Ktup= 1
A 2, 6, 7
F 4
H 1
I 9
L 11
Q 8
R 3
V 10
Y 5
others...
-7
–6
1
–5
–4
1
V
t =
–2
–1
1
2
1
3
M
4
A
5
A
6
Q
7
I
8
A
+9
Hash table
–3
2
D
-2 -3
+2 +1 +2 +2 -6
+3 +2
-2
-1
0
+1
+2
+3
1
4
1
Offset vector
+4
+5
+6
+7
+8
+9 +10
1
From: G.J .Barton:
Protein Sequence
Alignment and Database
Scanning
in Protein Structure
prediction - a practical
approach,
Edited by M. J. E.
Sternberg, IRL Press at
Oxford University Press,
1996
FASTA, forts.
FASTA vil så koble samme to eller flere k-tupler
dersom de ikke ligger for langt fra hverandre, disse
utgjør sammen en region. Kan ses på som en lokal
sammenstilling uten gap.
De 5 beste regionene fra forrige fase poengsettes
så på ny med PAM120 eller PAM250. Dette er
første mål på likhet mellom r og s og kalles initial
score i resultatfilen. En slik regnes ut for alle
sekvenser i databasen.
Optimized score regnes så ut a la Smith-Waterman,
men begrenset til ruter i et bånd rundt utgangssammenstillingen
FASTA – valg av
k-tuple-verdi
For DNA-søk er ktup 4-6, for proteinsøk
1eller 2.
Valg av ktup har innvirkning på resultatet:
 Lav ktup øker sensitiviteten, dvs. evnen
til å finne fjerne slektninger
 Høy ktup øker selektiviteten, dvs. evnen
til å forkaste falske positiver
Varianter av FASTA
PROGRAM
FUNCTION
fasta3
scan
fastx/y3
compare
tfastx/y3
compares
fasts3
compares
fastf3
compares
a protein or DNA sequence library for
similar sequences
a DNA sequence to a protein
sequence database, comparing the
translated DNA sequence in forward and
reverse frames.
a protein to a translated DNA
data bank
linked peptides to a protein
databank
databank
mixed peptides to a protein
FASTA-resultater
Parametere som sier noe om hvor
gode våre databasetreff er

Init1: score of the highest scoring initial region
 Initn: sum of initial scores of joined regions minus
joining penalty for each gap
 opt: score of optimal alignment of the region
 Z: measure of how unusual the original match is. If
score=S, Z=(S-mean)/sd
 P: probability that the alignment is no better than
random
 E(n): expected number of sequences giving the same
z-score or better if the database is probed with a
random sequence. E=P*(database size n)
Vurdering av resultater

Z-score > 5: significant
 P < 10-100: eksakt treff
10-100 < P < 10-50: nesten identiske sekvenser
10-50 < P < 10-10: nær beslektede, sikker
homologi
10-5 < P < 10-1: vanligvis fjerne slektninger
P > 10-1: Trolig ikke signifikant treff

E < 0.02: Trolig homologe sekvenser
0.02 < E < 1: homologi kan ikke utelukkes
E > 1: tilfeldig?
Slik virker BLAST (Basic Local Alignment
Search Tool)





Blast lager en liste over alle tretegns-ord (words,
delsekvenser) i søkeproteinet (for sekvensen
MEFGALLY.. blir de MEF, EFG, FGA, GAL osv.)
Ved bruk av BLOSUM62 identifiseres for hvert av
disse ordene ord som gir en score over en viss
grenseverdi (neighborhood word score threshold)
(ca. 50 nye ord for hvert utgangsord
Hver sekvens i databasen gjennomsøkes så for
eksakte treff med hvert av de 50 ordene for hver
posisjon i søkesekvensen
Treffene utvides så til poengsummen begynner å bli
lavere. Resultatet er et lengre sammenstilte
sekvensstrekk kalt HSP (high-scoring segment pair).
Sammenkobling av HSP med egnet plassering.
From: G.J .Barton:
Protein Sequence
Alignment and Database
Scanning
in Protein Structure
prediction - a practical
approach,
Edited by M. J. E.
Sternberg, IRL Press at
Oxford University Press,
1996
BLAST-resultater
BLAST-resultater, fortsatt
Varianter av Blast






blastp compares an amino acid query sequence against a
protein sequence database
blastn compares a nucleotide query sequence against a
nucleotide sequence database
blastx compares a nucleotide query sequence translated in all
reading frames against a protein sequence database
tblastn compares a protein query sequence against a
nucleotide sequence database dynamically translated in all
reading frames
tblastx compares the six-frame translations of a nucleotide
query sequence against the six-frame translations of a
nucleotide sequence database. Please note that tblastx is
extremely slow and cpu-intensive
Psi-blast - Position Specific Iterated BLAST uses an iterative
search in which sequences found in one round of searching
are used to build a score model for the next round of
searching. Highly conserved positions receive high scores and
weakly conserved positions receive scores near zero. The
profile is used to perform a second (etc.) BLAST search and
the results of each "iteration" used to refine the profile. This
iterative searching strategy results in increased sensitivity
Det humane genom
Horizontal gene transfer?
Probable vertebrate-specific
acquisition of bacterial genes
Men nei….
Men nei, fortsatt
Fylogenetisk analyse
Hva gikk feil?
”A different methodological reason for several of the
genes in the human genome report being considered
as bacteria±vertebrate HGTs, was that phylogenetics
was not the analytical approach, and that the
conclusions were instead derived largely from top
BLAST hit results. In several instances the top BLAST
hit was indeed a bacterial species, whereas further
down the list of significant BLAST hits one finds a nonvertebrate eukaryote. When such sequences were
properly aligned, the resulting phylogenetic trees often
supported the monophyly of eukaryotes with the
nonvertebrate eukaryote at the base.”
ClustalW-sammestilling
*
20
*
40
*
60
Human
: ------------------------------------------------------------- :
Termotoga : --------------------------------------MMSGHNKWANIKHRKMAQDAKKS :
C.elegans : MFSPLRRLTTTGLQLQKLQKLQKLQQFQPARAVHLTVFQQKGHSKWQNIKAVKGKNDLIRS :
gh kw nik k
d
s
23
61
*
80
*
100
*
120
Human
: ------------------------------------------------------------- :
Termotoga : KIFTKLIREIIVAAREGGGNIETNPRLRAAVERARAENMPKENIERAIKRGTGELEGVDYQ : 84
C.elegans : KATNFLLRKVRGAVSRGGFDMKLNRELADLESEFRAQGLPLDTLKNFLQKMKDKPE----V : 118
k
l r
a
gg
n l
ra
p
e
*
140
*
160
*
180
Human
: ------------------MNKNGGVMAVGARHSFDKKG-VIVVEVEDR-----EKKAVNLE : 37
Termotoga : EVIYEGYAPGGVAVYIRALTDNKNRTAQELRHLFNKYG-GSLAESGSVSWIFERKGVIEIS : 144
C.elegans : EYSFDIIGPSGIFLIVTAETSNKKAFENDLRKYFNKLGGFRLAADGGVRSWFEEKGVVHVD : 179
e
p g
a t Nk
a lRh F1K G
6ae g v
feeKgv6 6
*
200
*
220
*
240
Human
: R---ALEMAIEAGAEDVKETEDEEER-------NVFKFICDASSLHQVRKKLDSLGLCSVS : 88
Termotoga : R---DKVKDLEELMMIAIDAGAEDIKDAE----DPIQIITAPENLSEVKSKLEEAG-YEVE : 197
C.elegans : TKKGGKILNIEEMEEIGLEFDAEEVLLIEEDSTKKFELICDAKSLQTLENGLGKGGFSILQ : 240
r
k
6Ee ei e aEe
e
f Icda sL 6 kL
G
6
*
260
*
280
*
300
Human
: CALEFIPNSKVQLAEPDLEQAAHLIQALSNHEDVIHVYDNIE--------------- : 130
Termotoga : AKVTFIPKNTVKVTGKDAEKVLEFLNALEDMDDVQEVYSNFEMDDKEMEEILSRLEG : 254
C.elegans : SEIEFRPVHPIDCPEAEEPKVQKLYEMLQEDEQVRQIFDNITPDE------------ : 285
6eFiP
6
e d ekv l aL
edV 65dNie d
Konklusjonen
”Most of our analyses and phylogenetic topologies are
highly consistent with the view that vertebrates and
bacteria share these loci through common ancestry,
involving a succession of non-vertebrate eukaryote
intermediates. A further point arising from our analysis
is that the evolutionary relation-ships among proteins
cannot be concluded solely from the ranking of
database hits in homology searches (for example,
BLAST reports). This is not a new conceptual point
(see refs 7, 12, 13), but one that seems to have been
overlooked in this instance. Phylogenetic analysis must
be a central component of any protein family or
genome annotation effort. Importantly, phylogenetic
reconstruction is critical to synthesizing, from the
growing wealth of sequence data, a more
Related documents