Download Bolsum and PAM Matrix

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Expanded genetic code wikipedia , lookup

History of molecular evolution wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Biochemistry wikipedia , lookup

Biosynthesis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Homology modeling wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic code wikipedia , lookup

Mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Molecular evolution wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
DNA sequences alignment
measurement
Lecture 13
Introduction
• Measurement of “strength” alignment
• Nucleic acid and amino acid substitutions
• Measurement of alignment gaps
Measurement of aligned sequences
• When aligning sequences (DNA/AA ) it is assumed
that:
– they have a common ancestor;
– the differences between the sequences is the result of
mutations
– important areas like coding sequences (CDS) will be
conserved. There is a bias “against” mutations in these
areas
– Furthermore there is a bias in the types of mutations:
substitutions more likely that insertions/deletions….
• The dot plot gives a visual representation of sequence
alignment regions. But how do we measure the
strength of these alignments.
Measurement of aligned sequences
• One way is to count the mismatches: the “difference”
between the sequences.
– Hamming distance; :
• The distance corresponds to mismatches for strings of equal length.
– agtc
– cgta Distance is 2 (give another example)
• If the sequences (strings) are not of equal length then use:
– The Levenshtein distance: is the minimum number of edit
operations (alter/ insert/delete) to required to turn one string
into another:
• ag- tcc
• cgctca what is the levensthein distance?
• The latter technique has the advantage of allowing the
inclusions of gaps
Measurement of matching
• But what about the biological plausibility of
these approaches to measuring “differences”
between sequences (strings):
• DNA sequences (string mismatches) are
different:
– due to the probability of substitution; insertions,
deletions is not the same.
– Certain types of mutations like inversions;
translocations; duplications …. Complicate the
assessment of similarity; e.g. how would you treat
tandem repeats; inverted repeats….
Nucleic Acid mutations
• In sequence alignment we are trying to determine have
the differences (similarity) occurred due to:
– chance (random mutations)
– They had a common origin (degree of conservatism)
• One approach would be to count the percentage of
matches but there is now a need to include the bias
associated with possible substitutions.
• However, similarity does not necessarily imply
common ancestor or visa versa Zvelebil and Baum
(2008 p. 74) suggest this can occur in convergent
evolution/divergent evolution.
• So the results need to be contextualised the findings of
alignment tests. (bat and bird both have wings…)
Alignment Scoring methods
• In general sequences are given a score at each
matching position and the one with the
largest score is optimal and is chosen;
however suboptimal may also need to be
considered.
• The most basic approach is obtained by
measuring the percentage of similarity.
• Given that not all “changes” occur with equal
chance there is a need to develop:
– A nucleotide substitution matrix
•
Nucleotide scoring Matrix
• While it is know that certain mutations are more
likely to occur than others: e.g. transitions a<->g is
more common than transversions c<->t.
• However since the probability of such difference is
insignificant in relation to the chance of a mutation
itself the differences are mostly ignored. The
following shows a typical scoring matrix for
nucleotides.
Adapted from Baxevanis p. 303
Nucleic acid scoring Matrix
• The values are based on the probability of a
type of substitution occurring (expected
value); this includes a nucleotide substituting
with itself.
• These expected values are calculated by
getting the ratio of :
– number of “observed changes” /number of
changes “due to chance”
• These values are obtained by examining large
numbers of DNA sequences.
Nucleic acid scoring Matrix
• Then calculate 10*log 10
(“expected value”).
• This ensures that
adjacent nucleotides
expected values can
now be added as
opposed to being
multiplied in
determining the
alignment score.
Nucleic acid scoring Matrix
• A expected value greater of 1 indicates the
substitution has the same change of
occurrence as it is was occurring randomly.
• A value greater than 1 indicates a bias in
favour or the substitution
• A values less than 1 indicates a bias against
the substitution.
• A value of 5 will give what expected value?
Measuring Protein similarity
• Deriving a matrix for proteins is more complex
because:
• There are 20 amino acids so much higher set of
substitutions.
• The amino acids have properties that affect the
structure and so the protein functionality.
• Therefore substitutions can be conserved or semiconserved
• Observations shows that conserved substitutions
• e.g. Hydrophobic <-> hydrophobic mutations are more common
• semi conserved; e.g. hydrophilic <-> hydrophobic
Dot plot Matrix: imperfect match
• Some alignments require
gaps to increase the
matching score; the gaps
are used represent
inclusion/deletion
mutations
• The diagram shows that
most of the 2 sequences
are aligned. Where there
are gaps indicates areas
of non-alignment or
mismatches: gaps or
substitutions
Adapted from: dotplot example
13
Measurement of alignment gaps
• Gaps represents insertions and deletions
• Baxevanis (2005) suggest that no more than “one
gap in 20 pairs is a good rule of thumb”.
• Gaps in alignments are penalised; given a
negative scoring value.
• The penalty associated with the using gaps is
dependent on
– Opening the gap (introducing an insertion or deletion)
– Extending the gap (as opposed to opening a new gap)
– The length of the gap (the number of
deletions/insertions).
Gap penalties
• There is no overall agreement on what values should
be assigned to gap penalties (Zvelebil e Baum 2008).
• The purpose of an insertion is to increase the strength
of the alignment.
• So choosing a high score will eliminate sequences with
gaps while of the score is too low then alignments with
more and larger gaps will be chosen.
• The value should also be dependent on how closely
“related” the alignments must be :
– So sequences with a very strict match would use a high gap
score.
– Alignment between distantly related species would use a
low gap score.
Potential Exam Questions
• What is the purpose of measuring the
strength of an alignment
(3 marks)
• Explain two differences between analysing a
string (sequence) and a DNA string. (4 marks)
• Describe how you would measure the
similarity between two DNA sequences
(10 marks)
• Discuss the use of gap penalties in a sequence
alignment score
(13 marks)
References
• Baxevanis A.D. 2005 Bioinformatics: a
practical guide to the analysis of genes and
proteins chapter 11; Wiley
• Lesk, A. 2008; Introduction to bioinformatics,
3rd edition, oxford university press
• Zvelebil e Baum (2008) Understanding
Bioinformatics