Download 1. Pam matrices

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mutation wikipedia , lookup

Protein wikipedia , lookup

Bottromycin wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Cell-penetrating peptide wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Peptide synthesis wikipedia , lookup

Proteolysis wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Metabolism wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Molecular evolution wikipedia , lookup

Homology modeling wikipedia , lookup

Protein structure prediction wikipedia , lookup

Biochemistry wikipedia , lookup

Genetic code wikipedia , lookup

Expanded genetic code wikipedia , lookup

Transcript
INTRODUCTION
PAM matrices defined as point accepted mutation matrices or percent accepted
mutation matrices (Dayhoff). The main point of this matrices is to use the sequences that
quite similar to each other. From the sequence we can derive substitution frequency and
predict or assume the value of frequencies of an evolution. This matrices interested only
in mutations that have been “accepted” by natural selection. It was developed by
Margaret Dayhoff and co-workers and had been calculated in 1978. This matric also had
developed a precise and rigorous approach to implement a model of evolutionary change
in their mutation data matrix. PAM Matrices particularly used to score sequence
alignment (usually amino acid). One PAM of evolution means 1% of residues/bases have
changed (average overall 20 amino acids). To get relative frequency of each type of
mutation, we count the times it was observed of multiple sequence alignments. This
matrices also based on global alignments and assumption based on Markov model of
evolution.
Note: global alignment is a way of arranging the sequences of nucleotide and amino acid
which intend to align every base in every similar and roughly equal in size sequence.
Note: Mutation is change of the nucleotide sequence of the genome of an organism, virus,
or extra chromosomal genetic element.

Only mutations are allowed

Sites evolve independently
Example of codon table
By aligning the sequences, we assert that the align bases of each column had common
ancestor
PAM DISTANCES
PAM distances:
•
In the PAM matrix the as the number increases so does evolutionary distance
•
conversion one into the other with an average of 1 accepted mutation per 100
amino acids result in 2 sequences are at 1 PAM distance.
•
if we use many pairs of sequences at one PAM distance there will be about 1%
differences between each pair of amino acids.
•
We can derive the frequencies expected for each of the amino acid pairs
•
the probability of two independent events is the product of the two individual
probabilities.
PAM 1
•
PAM 1:
‒
1 substitution over 100 residues
‒
a PAM unit of time
‒
probability of each amino acid changing into another is ~ 1% and probability of
not changing is ~99%
Dayhoff PAM (250) Matrix
Example PAM 250 Matrices
MARKOVIAN EVOLUTION
•
Where the likelihood of the current state have great influence on next state.
•
Evolution is Markovian: base changes(or AA) occur at constant rate & depend
only on the identity of current base (or AA)
•
Example:
Markovian evolution is an extrapolation:
•
Start with all G’s. wait 1 MY. Where do they go?
- using PAM1, we expect them to mutate about 0.0002 A, 0.0007 P, 0.9946 G, etc
•
Wait another million years.
- new A’s mutate according to PAM1 for A’s
- etc
•
Wait another million & etc
- what is final distribution of AA at positions that were once G’s
SCORING MATRIX
Scoring matric is for computing alignment scores. Alignment score is the result of the
matrix’s entries for each aligned amino acid pair. So, the alignment score is calculated by
using the number of existence of a match of the 2 individual amino acids in
evolutionarily related sequences. It provides a judge of a chance alignment of the 2 amino
acids. PAM is a set matrices use to score sequence alignment. A sequence alignment is a
ways of arranging the sequences of amino acids and nucleic acids to identify the areas of
similarity. After computing the scoring matrices, if the alignment score is greater than 0,
the sequences are considered to be related. If the score is negative, it is assumed that the
sequences are not related. Scoring matrices can be used for any kind of sequence (e.g.
DNA and amino acid). Hence, the aligned amino acid pairs and frequently observed
substitutions are assigned the most positive scores. However, the matches that do not
have a result of evolution, meaning that they are tend to indicative of relatedness at that
location will be given negative scores. The matrices with scoring schemes based only on
side chain moiety similarity.
The codon picture showed the 4 DNA base combine with another 2 bases to form a codon
which is also an amino acid.
The following are the amino acid category as hydrophobic bond and hydrophilic bond.
Both hydrophilic and hydrophobic bond will determine the side chain moiety similarity
of the twenty amino acids.
Hydrophobic
Hydrophilic
Alanine
Cysteine
Glycine
Aspartic acid
Isoleucine
Glutamic acid
Leucine
Histidine
Phenylalanine
Lysine
Methionine
Asparagine
Proline
Glutamine
Valine
Arginine
Typtophan
Serine
Threonine
Tyrosine
Example 1:
TAGHVRP
HVGGSQM
T - H = (Hydrophilic-hydrophilic)
A - V = (Hydrophobic-Hydrophobic)
G - G = (Hydrophobic-Hydrophobic)
H - G = (Hydrophilic-Hydrophobic)
V - S = (Hydrophobic-Hydrophilic)
R - Q = (Hydrophobic-Hydrophobic)
P - M = (Hydrophobic-Hydrophobic)
Similar side chain is more than different side chain in protein sequence.
Result: The scoring matric will be positive score
Example 2:
ATGRFWV
NVTRFEY
A - N = (Hydrophobic-Hydrophilic)
T - V = (Hydrophilic-Hydrophobic)
G - T = (Hydrophobic-Hydrophilic)
R - R = (Hydrophilic-Hydrophilic)
F - F = (Hydrophobic-Hydrophobic)
W - E = (Hydrophobic-Hydrophilic)
Different side chain is more than identity and similar side chain in sequence.
Result: The scoring matric will be negative score.
S = [sij] gives score of aligning character i with character j for every pair i, j.
Example 3:
Using PAM250 scoring matrix, calculate the alignment score.
Given a aligning character I with character j
STPP
CTCA
0 + 3 + (-3) +1 = 1
Gap
1. A consecutive run of spaces without interruption in a sequence alignment known
as gap. It corresponds to insertion or deletion of amino acid and DNA.
2. Gaps are represented as dashes on a protein or DNA sequence alignment.
3. The number of insertion and deletions in sequence alignment is calculated as the
length of a gap.
Example 4:
attc--ga-tggacc
a--cgtgatt---cc
Seven matches, no mismatch, four gaps and eight spaces
4. Gaps are introduced into sequence alignments that allow the alignment to be
extended into regions where one sequence may have lost or gained sequence
characters which not found in the other.
Gap penalties
1. Gap penalty values are planned to reduce the alignment score when an alignment
of sequence has been interrupted by insertion and deletion of amino acid in the
sequences.
2. A penalty value is subtracted for each gap introduced into an alignment because
the gap increases randomly into an alignment
3. The gap penalty is used to help decide whether or not to accept a gap or insertion
and deletion in an alignment of sequence when it is possible to attain a good
alignment residue-to-residue at some other neighboring point in the sequence.
4. A high sequence alignment score is attainable even between unrelated or random
sequences if the value of gap penalty is too low.
5. The value should be small enough to allow a previously accumulated alignment to
continue with insertion and deletion in one of the sequences. Also, it should be
not too large that this previous alignment score.
RELATIVE MUTABILITY
Dayhoff et al. described the “relative mutability” of each amino acid as the
probability that amino acid will change over a small evolutionary time period. The total
number of changes are counted (on all branches of all protein trees considered), and the
total number of occurrences of each amino acid is also considered. A ratio is determined.
The relative mutability of the individual amino acids was calculated because amino
acids are not equally mutable. That is to say, some residues are observed to mutate more
frequently than others per occurrence. This is taken into consideration by defining the
relative mutability of amino acid j as the number of times amino acid j mutated divided
by the number of occurrences of amino acid j. Data from the relative mutability of the
amino acids and the point accepted mutation matrix is then used to calculate the mutation
probability matrix.
Relative mutability  [changes] / [occurrences]
Example:
Sequence 1
ala
his
val
ala
Sequence 2
ala
arg
ser
val
For ala, relative mutability = [1] / [3] = 0.33
For val, relative mutability = [2] / [2] = 1.0
SUBSTITUTION FREQUENCY
A substitution matrix contains values proportional to the probability that amino acid
i mutates into amino acid j for all pairs of amino acids.
Substitution matrices are constructed by assembling a large and diverse sample of
verified pairwise alignments (or multiple sequence alignments) of amino acids.
Substitution matrices should reflect the true probabilities of mutations occurring
through a period of evolution.
Example F G,A
Substitution may occur in A→G or G→A.
Therefore, F G,A = 3
MUTATION PROBABILITY
Mutation probability matrix can be done after we have calculated the relative
mutability. Data from the relative mutability of the amino acids and the point accepted
mutation matrix is used to calculate the mutation probability matrix.
For example:
F G,A
Mij = (mj*Fij) / (sum_over_all_i Fij)
Mij shows the probability that an original amino acid j (in columns) will be replaced by
amino acid i (in rows) over a defined evolutionary interval.
The entries, Ri,j are the Mi,j values divided by the frequency of occurrence, fi, of residue
i.
•
f G = 10 G / 63 residues = 0.1587
•
R G,A = log (2.1/0.1587)
= log(13.2325)
= 1.1216
≈1
QUESTION (RELATIVE MUTABILITY)
1. You are given the sequences below:
There are how many G→X substitutions across all pairs of sequences?
AMINO ACID SUBSTITUTION MATRIX
In the process of evolution, there is a tendency where DNA mutation would occur
and eventually cause the gradual change in phenotype and genotype of an organism.
Hence, substitution matrix is the matrix that defined the number of times a bases in a
sequence is being switch to another bases over a period of time. This is commonly
happens in amino acid or DNA sequence. Divergence time and the bases that have been
substituted in these 2 sequences eventually will being represented in the substitution
matrix and is being used to determine the percentage similar between them and
Besides, amino acid substitution matrix was based upon 2 important concepts.
These concepts are being introduced by Dayhoff and co-workers. First concept was using
the ‘log-odd’ strategy to score the alignment. To do this, the alignment that has been
presumed to be correct was first being collected .Then the number of residues substitution
in the alignment was being estimated to make the ratio of observed to expected
probabilities where the expected probabilities are based on a model for chance
alignments. The odd ratio of each residue pair in the alignment should be multiplied or
equivalent in order to score an alignment. A negative log-odd score signify that the
residues exchange is likely to occur in chance alignment than in a correct alignment, and
vice versa for positive score.
Second concept was that to base substitution numbers on estimated mutation rate.
Every occurrence of mutation is assumed to be independent from the occurrence of the
previous mutation. Mutation rate should be calculated from sequence alignments that are
closely related in order to make this model valid. Else, intermediate occurrence of
mutation could be left out. A consequence of using closely related sequences to estimate
mutation rates is that these rates must be extrapolated to model greater evolutionary
distances
PAM (Point Accepted Mutation) MATRIX
In 1970s, Margaret Dayhoff has developed the first amino acid substitution matrix,
which is known as PAM matrix. PAM shows the unit of evolutionary change for protein
sequences. Each row and column in a matrix represents one of the 20 standard amino
acids. All the data being used in PAM were come from sequence alignments that are
closely related proteins, which involved more than 85% amino acid that are identity of
one another. Besides, PAM matrix is also based on global sequence alignment being
calculated by observed the dissimilarity in closely related proteins. Result of PAM matrix
is being represented in log-odd table where the ratio of related alignment sequence
divided by ratio of unrelated alignment sequence. This result is then being converted to
logarithm of base two.
In the process of computing PAM matrix, it is quite difficult to collect the statistics about
amino acid substitution in distantly diverged sequences. For example, finding the position
correspondence is relatively difficult if the PAM divergence distance between sequences
is large, as there are many insertions and deletions took place. Hence, it is easier to
compute PAM matrix with closely related protein sequence. During evolution, the most
likely changes in amino acid sequence can be predicted if the ancestor relationships
among the group of proteins are assessed. Margaret Dayhoff was the one who pioneered
this type of analysis and she used this analysis to produce scoring matrix (PAM matrix).
PAM matrix is used to compare two sequences which are a specific number of PAM
units apart. There are a few examples of PAM matrix. For example: PAM 1, PAM 2,
PAM 50 and etc. In the case of PAM 1, it is the matrix that calculated from comparisons
of sequences with no more than 1% divergence in amino acid sequence. PAM 1 usually
being used as the basis for calculating the other PAM matrix. Hence, all the other PAM
are extrapolated from PAM1 by assuming that repeated mutations would follow the same
pattern as those in the PAM1 matrix, and multiple substitutions can occur at the same
site. Using this logic, Dayhoff derived matrices as high as PAM250.
PAM MATRIX ASSUMPTION
We have to make some assumption before a PAM matrix is being generated. One
of these is that we assume PAM matrix only allowed mutation to occur. Besides, we
assume that the probability of amino acid X replacing Y is same as the probability of
amino acid Y replacing X.
Moreover, we assumed all the position in protein is equally mutable and to
decreased the mediated mutation, we also used consider the use of closely related
proteins.
In addition, the probability given by Markov model tell us that the replacement of
sequences at any sites and we assumed it depend only on amino acid at that site.
We also assumed that all amino sequences have amino acid composition.
GENERATING PAM MATRIX
According to Dayhoff approach in generating PAM 1 matrix, it is constructed
based on 1572 observed mutations in 71 families of closely related proteins. Statistics
were collected from aligned sequences that are one Pam unit divergence. PAM 1 matrix
is constructed as followed:
1. Mij = estimated probabilities of amino acids I mutating into amino acid J in one
PAM unit of evolutionary change.
2. Result: 20*20 real matrix is being generated. Values in each matrix column add
up to one.
3. Every possible identity and substitution is assigned a score based on the observed
frequencies of such occurrences in alignments of related proteins
4. The diagonal of the matrix that contain only positive scores
5. Residue pairs with
 Score >=1: amino acid pairs that are found as alternatives at exactly the
frequency predicted by chance.
 Scores <1: that these residues are not functionally equivalent.
Example of Dayhoff’s PAM matrix
PROBLEM OF PAM MATRIX
The main problem of PAM matrix is the false assumption which states all sites are
equally mutable. This is wrong as different sites have different possibilities of mutation
that would occur. Besides, the matrix is biased as not much protein being collected at that
time and it is mainly based on globular proteins that are small.
Calculate the difference in PAM matrix
1. A PAM 80 matrix represents sequences with an average of 50% identical amino
acids.
Mii = 50%
= 50/100
= 0.5
Mij= 1-0.5
= 0.5
= 50%
Hence, a PAM 80 matrix has 50% difference of amino acids.
2. A PAM 250 matrix represents sequences with an average of 20% identical amino
acids.
Mii = 20%
= 20/100
= 0.2
Mij= 1-0.2
= 0.8
= 80%
Hence, a PAM 250 matrix has 80% difference of amino acids.
COMPUTATIONAL STEPS
Where
f(j) is the frequency
n(j) is total number of occurrences in the jth amino acid
N is the total number of all amino acids
Supposed the alignments are
Sequence 1:
AB
Sequence 2:
AA
Frequency:
Estimated Probabilities
We construct the mutation matrix, M so that the entry M(i,j) represents the probability of
the jth amino acid.
Expected number of mutation:
This is the new expected number of mutation after factorization by α:
k-PAM probability matrix score will be computed (values of Mk are plugged instead of those of
M.)
REFERENCE
•
Drubin R. et al., Biological Sequence Analysis, Chapter 2.
•
Setubal J., Meidanis, J., Introduction to Molecular Biology, Chapter 3.
DEMO SOFEWARE
•
http://www.ebi.ac.uk/Tools/psa/emboss_needle/
EFFECTS OF PAM MATRICES
PAM Matrices used to find the relative frequency with which amino acid replaced
or substitute each other during the course of evolution that carry by the Margaret
DayHoff. By using the substitution matrices, the individual score was assigned to the
aligned sequence position and also to define the values for all possible pairs of the
residues. Through PAM substitution, they can trace the evolutionary origins of protein by
using the substitution frequency derived from the sets of closely related protein sequence.
Each matrix correspond to a particular quantity of accepted mutation, which are the
mutation that been retained in the sequence.
By using the PAM matrices there are a lot of advantages. Firstly, PAM matrices
based on global alignment, where it include both highest conserved and highest mutable
regions and all mutation in the sequence are involved without missing anything. It can be
extremely helpful in determining those process which are responsible for these mutation
and also provide criteria for select and fixed a mutation in the population. PAM tables
constructed based on the data from the sequence to provide information about the
changes in the structure of amino acid residue after given number of mutation. Lastly, it
provides empirical and experimental determination of conserved replacement.
The limitations or disadvantages of PAM matrices are it assumes that all types of
mutation are distributed uniformly across the protein and its uses data from the closely
related proteins to infer the relationship between different proteins.
COMPARISON BETWEEN PAM MATRICES AND BLOSUM
It compares the PAM matrices and BLOSUM because both of them are from same
scoring information but using different method and because of that the PAM100 different
with BLOSUM 100 but equal with BLOSUM 90.
METHOD
PAM Matrices
Percent accepted mutation
ALIGNMENT
Global alignment, take all
the mutation in the
sequence
OBJECTIVE
Refers to specific
evolutionary distance
ALIGNING DISTANTLY
ALIGNING CLOSELY
COMPARISON OF
SEQUENCE
RANGES
PAM 250 (long sequence)
PAM 120 (short sequence)
PAM 120, not more than
120% divergence
From identical to the
completely random
Yes, extrapolated from
PAM1 using an assumed
Markov chain
EXTRAPOLATION
BLOSUM
Blocks substitution
matrices
Local alignment, just take
the sequence that have
gap and not all the
sequence
Refers to percent identity,
always blend of distance
like in the database and
PROSITE
BLOSUM 50
BLOSUM 62
BLOSUM 62 , approx.
62% identity
Narrower ranges than
PAM matrices
Not, it based on observe
alignment from
comparison of closely
related proteins
Extrapolation is a process of estimating, beyond the original observation interval, the
value of the variable on the basis of its relationship with another variable.