Download BioInformatics - Protein Structure Prediction Summer Studentship

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein–protein interaction wikipedia , lookup

Western blot wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein wikipedia , lookup

Amino acid synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Biosynthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Biochemistry wikipedia , lookup

Proteolysis wikipedia , lookup

Genetic code wikipedia , lookup

Transcript
BioInformatics - Protein Structure Prediction
Rajalingam Aravinthan
Gad Abraham
Summer Studentship(2003/2004) Under the supervision of
Professor Heiko Schroder, Dr Margaret Hamilton and Dr Ron Van Schyndel
with help from Abdullah Amin and Saravanan Dayalan.
Aim of this presentation:
1.
About the project
2.
The Program
3.
To analyze what we have done so far and where we should
concentrate in our project.
The Project Specifications
This research project attempts to predict protein structures in
linear time. Our method is based on statistical analysis of known
structures We compare a given sequence of amino acids of unknown
structure with sequences from a database of known structures.
(any examples ?)
In particular we are looking for perfect matches of short subsequences (maximal length 5) from a given protein sequence.
With our approach we cannot go beyond the length of 5 as then
the expected number of matches in the existing database will be
below 1. We would like to complement and compare our method
by looking at "less then perfect matches" of sub-sequences of
considerably longer length than 5.
In order to do this it is necessary to implement the SmithWaterman algorithm (a well known algorithm used in the area of
homology modeling) This algorithm is then to be run against the
complete database of known protein structures.
From this project we expect the answer to the question of how
perfect matches of very short sub-sequences compare with "less
than perfect matches" of significantly longer sequences.
Eg : The phi and psi angles around Glysine in perfect matches
with ‘VGI’. And less than perfect matches with ‘AVGID’. In the
latter case if we have say ‘ALGID’ That will be considered as
well for analysis.
The PDB file
The pdb file for this project was created by Mr. Saravanan as part of
his PhD research. The pdb contains all the proteins in a systematic
format. It was very useful to our project as it is easy for us to extract
the angles and data from this text data base.
‘pdb’ file
<Serial Number> <Protein> <Amino Acid> <Phi> <Psi >
1 1BA1 GLY 0.000000 -61.200000
2 1BA1 PRO -63.900000 158.000000
3 1BA1 ALA -86.100000 153.400000
4 1BA1 VAL -118.500000 163.900000
: : : : : : :: : : : : : : : : :
377 1BA1 LEU -78.600000 -9.100000
378 1BA1 SER -82.900000 0.000000
379 1BA2 LYS 0.000000 144.100000
380 1BA2 ASP -71.700000 163.600000
: : : : : : :: : : : : : : : : :
10757863 1R1A ASP 21.000000 0.000000
1BA1
1B50
The Program
1. To take any protein from the pdb file “phi-psi” and create a
protein file that will be analyzed by the main program “Align.java” Ex: Say we are analyzing ‘1BA1’
2. The protein ‘1BA1’ is a chain of 378 Amino Acids. The
program ‘Align.java’ can take any number of odd length
sub-sequences from the ‘1BA1’ and do alignment with the
whole pdb file.
3.
Say we are considering a length of 5. So there could be
374 windows of length 5 sub-sequences of amino acids.
Sliding windows of sub-sequences in a chain of amino acids
Window1-first 5 a.a out of the 378 amino acids
S K G P A V G I D L G T T Y S G V Q H G K V…….
Window 3
4. For each sub-sequence (Window above) we will do the
alignment for all the proteins in the pdb file. There are around
17,388 proteins in the database. Alignment will be done
separately for each one of those 17,388 proteins.(Note 1)
5. Alignment creates an alignment matrix. We will be using a
score matrix to give values in the matrix. Ex: ‘blosum62’ or
‘pam250’
6. Our program is a modified model of the Smith-Waterman
algorithm.
1. When there is an amino acid code (code) match we give the
score from the score matrix being used.
2. When there is a mismatch we don’t give any gap costs instead
apply the score from the matrix used for those codes.
3. Each cell’s score is the sum of its score from matrix and the
upper diagonal neighbor's score instead of the max. of three
adjacent cells . Negative values are kept.
An example alignment for sequence AVGID with protein ‘1BA1’:
…….
……..
……..
Where
‘10d-test’ is the file with sub-sequence ‘AVGID’
‘1BA1’ is the pdb file (it could be one or many proteins)
‘blosum62’ is the score matrix used
5 is the length of sub-sequence
0.2 is the penalty as a percentage
Say ‘AVGID’ has a maximum score of 24 (that is when
matched with another ‘AVGID’ as the score matrix has higher
scores diagonally) then the penalty for this window of
sequence is 20% of 24 or approx 5.
Please see ‘BackTracking’ for the importance of this penalty
value
Blosum 62
::: :::::::::::::: :: ::: :::::::::: :
7. Back Tracking:
1. In this modified version of SW-Algorithm we only backtrack
from the bottom row. That is we want to preserve the matched
pdb sub-sequence to be the same length as the current window.
2. We ignore gaps altogether. So there is only diagonal trace
back.
3. Trace back is done when the score is >= target score
( Max.Score – penalty = 24- 20% of 24 = 24 –5 = 19) A sample
trace back is shown in green in the alignment Matrix.
4. We get the pdb sub-sequences and find the phi and psi angle
for the middle amino acid. Keep count of the phi and psi angles
separately in 1/10 degree intervals or 10 degree interval as
appropriate.
8. Draw the Final angle counts using jGraph
Example in 10 degree intervals for AVGID with the whole pdb
The points were taken as
the middle value of the
range except for zero, –
180.00 and +180.00
which were unchanged.
Ex:0-10 = 5,
40-50 = bin 45
180.00 = bin 180.0
0.0 = bin 0.0
9. The program out puts “phi” and “psi” files, which will contain the
graph points for the actual and predicted angles. When we say
predicted angles it is the angle for the high frequency of
occurrences for a given sub-sequence.
We can use jGraph to plot these points.
10. This simple way of finding the predicted angle is not accurate.
How can we say with (some)certainty that this will be the angle
for all the sub-sequence? We need more sophisticated way of
finding the peaks.
What we have done so far
Running tests on sequence lengths 3,5,7,9 for the penalty
of 0.1 to 0.5 and produced graphs in 1/10 degree
intervals.
Sequence 3: VGI ; penalty = 0.0 ; 1/10 Degree interval
phi
psi
Why Penalty?
The ‘pdb’ has about 11 million amino acids. This is from 20 known
Amino Acids.
When we are doing matching for length 3 sub-sequences the
probability of finding one is then 1/( 20^3) = 1/8000
So we could expect to find 10757863/8000 occurrences of that
particular sub-sequence. Which is around 1344. It is a good number
for statistics.
But, in this project, we are considering lengths of 5 or more!
For an exact match the probability is 1/(20^5) = 1/3.2million
So we could only expect 3 occurrences of them throughout the entire
database. This will not be enough to do statistical analyses.
Penalty gives the freedom of not matching exact sequences but, close
enough sequences. So the results could be considered for the analysis
in determining the phi, psi angles
Penalty Comparisons
Sequence 5 : AVGID ; Penalty = 0.1
phi
Sequence 5 : AVGID ; Penalty = 0.5
phi
psi
psi
For the sequence AVGID the max score is 24. When the penalty is
0.1 The the target score is 21. So from pdb any sub-sequence
having score more than 21 is considered. But the narrow margin
of penalty score 3 allows only certain elements to be swapped
depending on the chosen matrix. It depends on two factors.
1) Element been swapped and the score when matched against itself.
Ex: I-I = 4
2) The element comes in and the score when matched against the
element been swapped. I-V= 3
For the sub-sequence to be considered max score -(1) +(2) >= 21
must be true.
AVGID Searched Sequence
: .: . :
AIGVD Protein sequence
Both gives Alignment Score = 22
AVGID Searched Sequence
: :: .:
AVGLD Protein sequence
It shows that when the penalty is low(0.1) the matrix allows
only certain elements to be swapped. Here according to
blosum62 only the elements marked with pink
squares(positive values) can be swapped. Like V-I, I-V, I-L
And More precisely when penalty is 0.1 if ‘A’ has to swapped
out (A-A =4 according to blosum62) then any element having
more than a score of >=+1 with ‘A’ can take that place but, if
for ‘R’(R-R=5) then it has to be >=+2 with ‘R’ to keep the
target score >=21
So it shows increasing the penalty not only increases the
chances of elements being swapped but, also what elements
could be swapped as well.
Reasons for penalty as a percentage
Assume we have two sub-sequences in protein.
1. ‘AVGID’ max score =24 according to blosum62.
2. ‘PCHWT’. The max.score is 40 according to blosum62.
If the penalty is a number say 3 then the target score for seq1 is 21
and 37 for seq2. A lower max.score means the sequence contains
elements which are of lower scores. A penalty of 3 could be easily
substituted with another element. But for sequence2 substituting
even the lower scored element(T-T=5)costs 5 and so we have to
find an element having at least a score of 2 with ‘T’ to come in !
That means no substitution possible according to blosum62.
That is why we take a percentage of the max.score .If 0.1 penalty
for seq2 then that is 10% of 40=4.0 = 4
So any swap that loses a score of 4 or less can be considered
instead of 3 as before. So we have to find an element having at
least a score of 1 with ‘T’ to come in . Here ‘T’ could be
substituted with ‘S’ (T-S =1) and yet we can keep the score at 36
which is equal or above the target 36.
So we give a margin for sub-sequences to be considered and will
get at least some data from pdb.
What happens when the penalty increases?
Sequence 5 :AVGID ; Penalty = 3.0; 10 degree interval
The total max score
possible for AVGID
= 24. Giving penalty
3.0 makes the target
score = -48
(24*(1-3.0))
As we can see the
are the phi and psi
distribution for ‘G’
throughout the
whole pdb data
base
Phi
Artifacts
Psi
Length of Sequence Comparisons
Comparison : For a given penalty how graphs differ when
we take a length of 3, 5, 7, 9.
Important point to note is that we took a short sequence of
9 amino acids and took all these sub-sequences keeping
the 5th element as the middle one.
1 1BA1 GLY 0.000000 -61.200000
2 1BA1 PRO -63.900000 158.000000
3 1BA1 ALA -86.100000 153.400000
4 1BA1 VAL -118.500000 163.900000
5 1BA1 GLY -113.200000 129.800000
6 1BA1 ILE -125.600000 126.900000
7 1BA1 ASP -99.600000 104.400000
8 1BA1 LEU -89.200000 86.900000
9 1BA1 GLY -78.700000 161.000000
Sequence 3 : VGI ; Penalty = 0.0; 1/10th degree interval
psi
phi
Sequence 5 : AVGID ; Penalty = 0.4; 1/10 degree interval
phi
psi
Sequence 7 : PAVGIDL ; Penalty = 0.4; 1/10th degree interval
phi
psi
Sequence 9 :GPAVGIDLG ; Penalty = 0.4; 1/10th degree interval
phi
psi
th
1/10
Degree & 10 degree
interval for Sequence 3
Phi
1/10 degree
10 degree
Psi
Pam250 & Blosum62
Comparisons
Sequence 5 :AVGID ; Penalty = 0.3; 10 degree interval
Phi Angles
Psi Angles
Blosum62
Pam250
Sequence5, Blosum62
PhiAngles
for ‘G’Glysine in
sequence 5
with
penalties
10% –
100%
The phi Distribution for ‘G’
The psi Distribution for ‘G’
PsiAngles
for ‘G’Glysine
in
sequence 5
with
penalties
0.1 – 1.0
Predicting ‘1BA1’
Comparing the Phi and Psi angles for the first 25 amino acid
sequences from ‘1BA1’ with the test outputs. The peak values
were selected manually from the outputs.
‘1BA1’
Phi for
first 5
windows
‘1BA1’
Psi for
first 5
windows
END