Download Lecture3_HomologyAndAlignment2014_10sept

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomic library wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Genetic code wikipedia , lookup

Genome editing wikipedia , lookup

Human genome wikipedia , lookup

Helitron (biology) wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Microsatellite wikipedia , lookup

Metagenomics wikipedia , lookup

Point mutation wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Transcript
Sources
• Page & Holmes
• Vladimir Likic presentation:
http://science.marshall.edu/murraye/Clearer%20Matrix%20slide%
20show.pdf
• Wikipedia
• Lecture at :
http://cs.njit.edu/usman/courses/bnfo601_fall08/AffineGap.pdf
Homoplasy – structural or DNA resemblance
due to parallelism or convergent evolution
rather than to common ancestry
Which are homoplasious?
Problem: which base positions share common descent?
agtggtcttgctacattgctagctaaatcgatcatgatcgatgattcagg
tagctaaatcgatcatgatcgatgattcaggcgatgtcatgactgatcag
tacattgctagctaaatcgatcatgatcgatgattcaggcgatgtcatga
gatcatgatcgatgattcaggcgatgtcatgactgatcagggatgatgat
Alignment – residue to residue correspondence between 2 or
more sequences such that the order of residues in each
sequence is preserved.
agtggtcttgctacattgctagctaaatcgatcatgatcgatgattcagg
tagctaaatcgatcatgatcgatgattcaggcgatgtcatgactgatcag
tacattgctagctaaatcgatcatgatcgatgattcaggcgatgtcatga
gatcatgatcgatgattcaggcgatgtcatgactgatcagggatgatgat
Indels make alignment trickier
agtggtcttgctacattgctagctaaatcgatcatgatcgatgattcagg
tagctaaatcgatcatgatcgatgattcaggcgatgtcatgactgatcag
tacattgctagctaaa----tcatgatcgatgattcaggcgatgtcatga
gatcatgatcgatgattcaggcgat------actgatcagggatgatgat
Alignment problems (examples)
1) different sequences of the same allele from the same locus
within the same individual
2) sequences of different alleles from the same locus within the
same individual
3) same locus from different individuals
Assembly – (from ensembl) - When the genome of a species
is to be sequenced, the chromosomes from many cells are
broken at random positions into small fragments, which are
sequenced, and reassembled into long sequences (contigs).
Contigs may be assembled into longer sequences called
scaffolds and sometimes, if the depth of sequencing is high
enough, there may be enough information to assemble most
of the scaffolds into chromosomes. The resulting collection of
sequences after assembly is called a genome assembly.
Alignment Methods
• Dot plot – qualitative
• Sequence alignment – quantitative;
constructing the best alignment using a
scoring scheme
Types of Alignment
• Global – best alignment over the entire length
• Local – best alignment in small region; used when
comparing sequences of different lengths
• Multiple – beyond pairwise
cagcacttggattctgg
&
cagcgtgg
Local
cagca-cttggattctgg
---cagcgtgg------Global (best depending on gap penalties)
cagcacttggattctgg
cagc----g—t----gg
Gaps
• residue to nothing match that can be inserted in either
sequence
• are not part of the DNA sequence, only a construct for
alignment
• Gap to gap match is meaningless and not allowed
Dot plots – heuristic; make matrix, place dots; find diagonals
Alignment with scoring schemes
• score to select the best possible alignment given scoring
scheme
Scoring scheme
• A set of rules that assigns a score to a particular alignment
between two sequences
• Goal is to maximize score
• Score is sum of residue substitution scores and gap penalties
+1 for match
-1 for mismatch
No gap penalty
atggcgt
atg-agt
atggcgt
a-tgagt
+1+1+1-1+1+1 = 4
+1-1+1-1+1+1 = 2
Substitution matrix:
c
t
a
g
c t a g
1 -1 -1 -1
-1 1 -1 -1
-1 -1 1 -1
-1 -1 -1 1
What if we want to penalize transitions less than transversions?
Substitution matrix:
c
T
a
g
c t a g
2 1 -1 -1
1 2 -1 -1
-1 -1 2 1
-1 -1 1 2
Protein substitution matrices
• More complex than DNA scoring matrices.
• Proteins are composed of twenty amino acids, and
physical-chemical properties of individual amino acids
vary considerably.
• can be based on any property of amino acids: size,
polarity, charge, hydrophobicity.
• Evolutionary substitution matrices – empirically derived
by assessment of frequencies of changes at particular
levels of divergence
Evolutionary substitution matrices
• PAM ("point accepted mutation") family
PAM250, PAM120, etc.
• BLOSUM ("Blocks substitution matrix") family
BLOSUM62, BLOSUM50, etc.
• The BLOSUM matrices were developed more
recently and considered better.
Blosum62
Blosum80 is used for less divergent sequences
Blosum45 is used for more divergent sequences
Etc.
Gaps
• Because gaps often result in radical protein changes
(frame shifts, premature stop), the penalty for a gap is
usually several times greater than the penalty for a
mutation.
• Once created, gaps of more than one residue might be
less expensive than a completely new gap - in other
words gap opening penalties and gap extension
penalties are often defined separately
Affine gap penalty function W(i)
Wi=g+h*i
(for i>= 1, where i = gap length )
•g: gap opening penalty
•h: gap extension penalty
•The ratio between gand h determines the relative weight
for opening versus extension
–Small g, Large h: gap length more important
–Large g, Small h: gap length less important
Wi=g+h*i
G = -3
H = -1
ATGTAGTGTATAGTACATGCA
ATGTAG-------TACATGCA
ATGTAGTGTATAGTACATGCA
ATGTA--G--TA---CATGCA
Substitution matrix:
c
T
a
g
c t a g
2 1 -1 -1
1 2 -1 -1
-1 -1 2 1
-1 -1 1 2
26 – 3 – 1(7) = 16
26 – 3 (3) – 1(7) =10
How do we find the best alignment?
Brute-force approach:
Generate the list all possible alignments between
two sequences, score them, select the alignment
with the best score
The number of possible global alignments between
two sequences of length N is
2
2N
pN
For two sequences of 250 residues this is ~10149
Needleman-Wunsch and Smith-Waterman are both
algorithms that find the best alignment through breaking
the problem down into sub problems using dynamic
programming
…however, it is only the best based on the scoring matrix
and the gap opening and extension penalities
These methods are computationally expensive
BLAST – Basic Local Alignment Search Tool
- Tries to find the highest scoring ungapped local
alignment between a query and a database
- Uses a word length (w) and scans for matches with a
higher threshold (T) when aligned with words in the
query
- The local alignment is then extended in both directions
until the score falls below the best score reached so
far.
- Many types of blast can be found at
http://blast.ncbi.nlm.nih.gov/Blast.cgi