Download References for Lect#6

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Protein diversity and scoring
matrices
Oct. 10, 2012
Quiz 1-today
Learning objectives-General concepts of the molecular
basis of evolution.
The development of scoring matrices.
Homework assignment—do problems 1-5 in Chapter 4.
Evolutionary Basis of Sequence
Alignment
Why are there regions of identity when comparing protein
sequences?
1) Conserved function-amino acid residues
participate in reaction.
2) Structural (For example, conserved cysteine
residues that form a disulfide linkage)
3) Historical-Residues that are conserved solely
due to a common ancestor gene.
library.thinkquest.org/19012/treeolif.htm
Conserved regions within a
protein.
Molecular evolution and cancer
DNA damage
Mutation
Positive selection
Negative selection (purifying mutations)
Neutral mutations
Mutations in p53 gene in cancers
Duplicated domains
Retrovirus
Host cell
viral DNA integrated into
genome
5’LTR
3’LTR
Identity Matrix
A
C
I
L
1
0
0
0
A
1
0 1
0 0
C I
1
L
Simplest type of scoring matrix
Similarity
It is easy to score if an amino acid is identical to another (the
score is 1 if identical and 0 if not). However, it is not easy to
give a score for amino acids that are somewhat similar.
+NH
3
CO2-
+NH
3
CO2-
Isoleucine
Leucine
Should they get a 0 (non-identical) or a 1 (identical) or
Something in between?
One is mouse trypsin and the other is crayfish trypsin.
They are homologous proteins. The sequences share 41% identity.
BLOSUM62 scoring matrix
BLOSUM Scoring Matrices
Which BLOSUM Matrix to use?
BLOSUM
80
62
35
Identity (up to)
80%
62% (usually default value)
35%
If you are comparing sequences that are very similar, use
BLOSUM 80. Sequences that are more divergent (dissimilar)
than 20% are given very low scores in this matrix.
Which Scoring Matrix to use?
PAM-1
BLOSUM-100
Small evolutionary
distance
High identity within
short sequences
PAM-250
BLOSUM-20
Large evolutionary
distance
Low identity within
long sequences
P53 protein binds DNA
5'-PuPuPuC(A/T)(T/A)GPyPyPy-0-13 nucleotidesPuPuPuC(A/T)(T/A)GPyPyPy-3'
human
placozoa
human
1 MEEPQSDPSVEPPLSQETF----SDLWKLL-------------PENNVLS
.|.||.|||.:|
|..|:|:
.|...:.
1 -------MSDEPTLSQLSFSQELSSSWQLMIDEITQGKFNTNEDEGTAIY
34 PLPSQAMDDLMLSPDDIEQWFTEDPGPDEAPRMP-EAAPPVAPAP----....|..||..|...:..|:.:......:..::| |.|....|:|
44 SYSEQNPDDRYLMRPNEPQYISAGYPDGQVGQLPREFAVNQIPSPRTFSD
78 ---------------AAPTPAAPAPAPSWPLSSSVPSQKTYQGSYGFRLG
.|....:...:|......|:||...|.|::||.:.
94 NVSSSADKAREAYYGQAVNGVSAETSPPLKRDPSLPSNAEYIGNFGFDIA
113 F-LHSGTAKSVTCTYSPALNKMFCQLAKTCPV-----------------. .:....|:...|||..|.|:|.::....|:
144 IDQNDNPTKATNNTYSTMLKKLFIKMECLFPIHITIERMDYTFKIAYGSL
144 -------QLWVDSTPPPGTRVRAMAIYKQSQHMTEVVRRCPHHERCSDSD
||.:...||..:.:||..:|.:.|.:.|.|||||:| ...|..
194 ATRRNCNQLIIPGEPPANSYIRAYVMYTKPQDVYEPVRRCPNH-ALRDQG
187 GLAPPQHLIRVEGNLRVEYLDDRNTFRHSVVVPYEPPEVGSDCTTIHYNY
......|::|.|.. |.||.:| .:.||||.|||..|.||...:|:.|.:
243 KYESSDHILRCESQ-RAEYYED-TSGRHSVRVPYTAPAVGELRSTLLYQF
237 MCNSSCMGGMNRRPILTIITLEDSSGNLLGRNSFEVRVCACPGRDRRTEE
||.|||.|.:|||||..:||||:.: |:|||...||||||||||| |:.|
291 MCFSSCSGSINRRPIELVITLENGT-NVLGRKKVEVRVCACPGRD-RSNE
287 ENLRKKGEPHHELPP-------------------GSTKRALPNNTSSSPQ
|....|.|..|:.||
..:||.:..:..|:.
339 ERAAMKSEKEHKQPPNKKLKTSKTVSREVTGVISNESKRIMKRSVESTS318 PKKKPLDGEYFTLQIRGRERFEMFRELNEALE----LKDAQAG--KEPGG
:.:.||:.:|||:.:|:..:::|:||
|.|||.. |..|.
388 ------NDDVFTITVRGRKNYEILAKMSESLEVLDKLSDAQINEIKSHGT
362 -----SRAHSSHLKSKKGQSTSRHKKLMFKTEGPDSD------------.|.::..|..::.::....:..:...|..|..
432 LTAPLERTNTEELVRRQSRNLDTLQNAVTTKENSDGADLNLSISRWLSNI
394 --------------------------------------------------
placozoa
human
482 NMEKYTQEFIKHGFKVCGHLANVSYSDMKKIIKNMEDCKKISAYLLESNF
394 --------------------------------------------393
placozoa
532 SSGNEEDIPCSQIGNSFRASQMSMNSTASQELDITRFTLRQTITL
33
Human p53 and placozoa p53
placozoa
human
placozoa
human
placozoa
human
placozoa
human
placozoa
human
placozoa
human
placozoa
human
placozoa
human
placozoa
human
576
43
77
93
112
143
143
193
186
242
236
290
286
338
317
387
361
431
393
481
393
531
Scoring Matrices
1)
2)
3)
April 23, 2009
Learning objectivesLast word on Global Alignment
Understand how the Smith-Waterman algorithm can be
applied to perform local alignment.
Have a general understanding about PAM and
BLOSUM scoring matrices.
Homework 3 and 4 due today
Quiz 1 today
Writing topic due today
Homework 5 due Thursday, April 30.
Global Alignment
output file
Global: HBA_HUMAN vs HBB_HUMAN
Score: 290.50
HBA_HUMAN
1
HBB_HUMAN
1
HBA_HUMAN
45
HBB_HUMAN
44
HBA_HUMAN
84
HBB_HUMAN
89
HBA_HUMAN
129
HBB_HUMAN
134
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFP 44
|:| :|: | | |||| : | | ||| |: : :| |: :|
VHLTPEEKSAVTALWGKV..NVDEVGGEALGRLLVVYPWTQRFFE 43
HF.DLS.....HGSAQVKGHGKKVADALTNAVAHVDDMPNALSAL 83
| |||
|: :|| ||||| | :: :||:|::
: |
SFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATL 88
SDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKF 128
|:|| || ||| ||:|| : |: || |
|||| | |: |
SELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKV 133
LASVSTVLTSKYR
:| |: | ||
VAGVANALAHKYH
141
146
%id = 45.32
%similarity = 63.31 (88/139 *100)
Overall %id = 43.15; Overall %similarity = 60.27 (88/146 *100)
Smith-Waterman Algorithm
Advances in
Applied Mathematics, 2:482-489 (1981)
Smith-Waterman algorithm –can be used for local alignment
-Memory intensive
-Common searching programs such as BLAST use SW algorithm
Smith-Waterman (cont. 1)
a. Initializes edges of the matrix with zeros
b. It searches for sequence matches.
c. Assigns a score to each pair of amino acids
-uses similarity scores
-uses positive scores for related residues
-uses negative scores for substitutions and gaps
d. Scores are summed for placement into Mi,j. If any
sum result is below 0, a 0 is placed into Mi,j.
e. Backtracing begins at the maximum value found
anywhere in the matrix.
f. Backtrace continues until the it meets an Mi,j value of 0.
Smith-Waterman (cont. 2)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0
0 0 0 3 0 2012 4 0 0
10 2 0 0 1 12182214 6
2 16 8 0 0 4101828 20
0 82113 5 0 41020 27
0 6131912 4 0 416 26
Put zeros on top row
and left column.
Assign initial scores
based on a scoring
matrix. Calculate
new scores based on
adjacent cell scores.
If sum is less than
zero or equal to zero
begin new scoring
with next cell.
This example uses the BLOSUM45 Scoring Matrix with a gap
penalty of -8.
Smith-Waterman (cont. 3)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0
0 0 0 3 0 2012 4 0 0
10 2 0 0 1 12182214 6
2 16 8 0 0 4101828 20
0 82113 5 0 41020 27
0 6131912 4 0 416 26
AWGHE
|| ||
AW-HE
Score=28
Begin backtrace at the
maximum value found
anywhere on the
matrix.
Continue the backtrace
until score falls to zero
Smith-Waterman (cont. 3)
H E A G A W G H E E
P
A
W
H
E
A
E
0
0
0
0
0
0
0
0
0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 5 0 5 0 0 0 0 0
0 0 0 3 0 2012 4 0 0
10 2 0 0 1 12182214 6
2 16 8 0 0 4101828 20
0 82113 5 0 41020 27
0 6131912 4 0 416 26
AWGHE
|| ||
AW-HE
Score=28
Begin backtrace at the
maximum value found
anywhere on the
matrix.
Continue the backtrace
until score falls to zero
Calculation of similarity score and
percent similarity
A W G H E
A W - H E
5
15 -8
10
6
Blosum45 SCORES
GAP PENALTY (novel)
% SIMILARITY =
NUMBER OF POS. SCORES
DIVIDED BY NUMBER OF AAs
IN REGION x 100
% SIMILARITY = 4/5 x 100
= 80%
Similarity Score= 28
Why search sequence databases?
1. I have just sequenced something. What is
known about the thing I sequenced?
2. I have a unique sequence. Does it have
similarity to another gene of known
function?
3. I found a new protein sequence in a lower
organism. Is it similar to a protein from
another species?
Perfect searches for similar
sequences in a database
First “hit” should be an exact match.
Next “hits” should contain all of the
genes that are related to your gene
(homologs).
Next “hits” should be similar but are
not homologs
How does one achieve the
“perfect search”?
Consider the following:
Scoring Matrices (PAM vs. BLOSUM)
Local alignment algorithm
Database
Search Parameters
Expect Value-change threshold for score
reporting
 Translation-of DNA sequence into protein
 Filtering-remove repeat sequences

Placozoa
Mobile element
Related documents