Download Basics of sequence analysis Ch.6 and Ch.7

Document related concepts

Biochemistry wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Proteolysis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic code wikipedia , lookup

Community fingerprinting wikipedia , lookup

Point mutation wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Genomic library wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Protein structure prediction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Basics of sequence analysis
Ch.6 and Ch.7
•
•
•
•
•
•
•
Sequence acquisition
Sequence data
Reconstructing sequence
Sequence alignment
Alignment algorithms
Database searching
Uses of alignments
http://upload.wikimedia.org/wikipedia/commons/c/cb/Sequencing.jpg
ABI era
Source: wikipedia
Scaling up by brute force
$3,000,000,000 genome
Source: G. Church
Source: G. Church
A Genome Analyzer flowcell (left) and imaging region or
‘tile’ (right), with a magnified section showing a cluster.
Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.
http://www.eurofinsdna.com
Source: Bioinformatics. 2009 Sep 1;25(17):2194-9. Epub 2009 Jun 23.
Confidence call
Name
Sequence
Alignment
Read
CCCATCGCCCAGTTCCAGATCCCTTGCCTGATTAAAAATAC
Reference genome
Read
Hypothesis #1
Genome ref
Read
Hypothesis #2
Genome ref
Quality, Q, is the function of the probability, P,
that the sequences called a wrong base
Q = −10 log10 ( P)
Q=10: 1 in 10 chance that base was miscalled
Q=20: 1 in 100 chance that base was miscalled
Q=30: 1 in 1000 chance
Q is estimated by the sequencing sofware.
Read
Hypothesis #1
Genome ref
Read
Hypothesis #2
Genome ref
Read
Q=30
Hypothesis #1
Genome ref
Read
Q=10
Hypothesis #2
Genome ref
A penalty scheme to account for different types of dissimilarities
Let’s stipulate that small gaps (indels)
occur in bacterial genomes at 1 in 10K positions
Penalty gap = −10 log10 ( Pgap ) =
= −10 log10 (0.0001) = 40
Let’s stipulate that SNPs occur in bacterial genomes at 1 in 1K,
but depending on Q, sequencing error maybe more likely
Penmin ≡ min{− 10 log10 ( Pmiscall ), − 10 log10 ( PSNP )} ≡
≡ min{Q, − 10 log10 (0.001)} = min{Q,30}
Penaltygap = 40
PenaltySNP =30
Penalty = 40
Q=40
Penalty = 70
Q=10
Penalty = 50
How do we find our sequence in the first place?
Local alignment
Given a string P (“pattern”) of length m and a string T
(“text”) of length n, find substrings a and b of P and T, respectively,
having maximal global alignment score
Event
Example
Penalty
Match
TTG
|||
TTG
+15
Mismatch
TCG
|||
TTG
-30
Gap in
text
Gap in
pattern
TCG
|||
T-G
-40
T-G
|||
TTG
-40
Smith-Waterman
Smith-Waterman
Smith-Waterman
Smith-Waterman
Smith-Waterman
Smith-Waterman
Dynamic Programming
Indexing
Indexing
Dot Plots
Window:2, Stringency:1
Dot Plots
Window:2, Stringency:1
Dot matrix analysis
of DNA sequence encoding λ cI repressor (vertical) and P22 c2
repressor
Window - 11;
Stringency - 7
Analysis of the regions of low complexity
Calculation of an alignment score
Pairwise Alignment Examples (II)
Dispersed alignment without gaps may
have higher score than more
visually appealing alignment with gaps
An alignment scoring system is required
to evaluate how good an alignment is
• positive and negative values assigned
• gap creation and extension penalties
• positive score for identities
• some partial positive score for
conservative substitutions
• global versus local alignment
• use of a substitution matrix
“Window location” by FASTA and BLAST
Two kinds of sequence alignment:
global and local
The global alignment algorithm of Needleman and Wunsch
(1970).
The local alignment algorithm of Smith and Waterman (1981).
BLAST, a heuristic version of Smith-Waterman.
Should result of alignment include all amino acids or proteins or just those
that "match"? If yes, a global alignment is desired
In a global alignment, presence of mismatched elements is neutral - doesn't
affect overall match score
Should result of alignment include all amino acids or proteins or just those
that "match"? If no, a local alignment is desired
Local alignments accomplished by including negative scores for
"mismatched" positions, thus scores get worse as we move away from region
of match
Instead of starting traceback with highest value in first row or column, start
with highest value in entire matrix, stop when score hits zero
What is Database Search ?
• Find a particular (usually) short sequence in a database
of sequences (or one huge sequence).
• Problem is identical to local sequence alignment, but on
a much larger scale.
• We must also have some idea of the significance of a
database hit.
– Databases always return some kind of hit, how much attention
should be paid to the result?
• A similar problem is the global alignment of two large
sequences
• General idea: good alignments contain high scoring
regions.
Imperfect Alignment
• What is an imperfect alignment?
• Why imperfect alignment?
• The result may not be optimal.
• Finding optimal alignment is usually to
costly in terms of time and memory.
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR
Database Search Methods
• Hash table based methods
– FASTA family
• FASTP, FASTA, TFASTA, FASTAX, FASTAY
– BLAST family
• BLASTP, BLASTN, TBLAST, BLASTX, BLAT, BLASTZ,
MegaBLAST, PsiBLAST, PhiBLAST
– Others
• FLASH, PatternHunter, SSAHA, SENSEI, WABA, GLASS
• Suffix tree based methods
– Mummer, AVID, Reputer, MGA, QUASAR
History of sequence searching
•
•
•
•
1970:
1980:
1985:
1990:
NW
SW
FASTA
BLAST
Hash Table
• K-gram =
subsequence of
length K
• Ak entries
– A is alphabet size
• Linear time
construction
• Constant lookup
time
FASTP
Lipman & Pearson, 1985
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches
FASTP
• Three phase algorithm
1. Find short good matches using k-grams
1. K = 1 or 2
2. Find start and end positions for good
matches
3. Use DP to align good matches
FASTP: Phase 1 (2)
• Similar to dot plot
• Offsets range from 1-m to
n-1
• Each offset is scored as
– # matches - # mismatches
• Diagonals (offsets) with
large score show local
similarities
FASTP: Phase 2
• 5 best diagonal runs
are found
• Rescore these 5
regions using
PAM250.
– Initial score
• Indels are not
considered yet
FASTP: Phase 3
• Sort the aligned regions in descending
score
• Optimize these alignments using
Needleman-Wunsch
• Report the results
FASTA – Improvement Over
FASTP
Pearson 1995
FASTA (1)
• Phase 2: Choose 10 best diagonal runs instead of 5
FASTA (2)
• Phase 2.5
– Eliminate diagonals that score less than some given threshold.
– Combine matches to find longer matches. It incurs join penalty
similar to gap penalty
FASTA Variations
• TFASTAX and TFASTAY: query protein
against a DNA library in all reading frames
• FASTAX, FASTAY: DNA query in all
reading frames against protein database
BLAST
Altschul, Gish, Miller, Myers,
Lipman, 1990
BLAST (or BLASTP)
• BLAST – Basic Local Alignment Search
Tool
• An approximation of Smith-Waterman
• Designed for database searches
– Short query sequence against long database
sequence or a database of many sequences
• Sacrifices search sensitivity for speed
BLAST Algorithm (1)
• Eliminate low complexity regions from the
query sequence.
– Replace them with X (protein) or N (DNA)
• Hash table on query sequence.
– K = 3 for proteins
MCGPFILGTYC
CGP
MCG
BLAST Algorithm (2)
• For each k-gram find all
k-grams that align with
score at least cutoff T
using BLOSUM62
PQGMCGPFILGTYC
QGM
– 20k candidates
– ~50 on the average per kgram
– ~50n for the entire query
• Build hash table
PQG
PQG
PQG
PEG
PRG
PSG
PQA
18
15
14
13
12
T = 13
BLAST Algorithm (3)
• Sequentially scan the database and locate
each k-gram in the hash table
• Each match is a seed for an ungapped
alignment.
BLAST Algorithm (4)
• HSP (High Scoring Pair) = A match between a
query word and the database
• Find a “hit”: Two non-overlapping HSP’s on a
diagonal within distance A
• Extend the hit until the score falls below a
threshold value, X
BLAST Algorithm (5)
• Keep only the extended matches that have
a score at least S.
• Determine the statistical significance of the
result
BLASTN
• BLAST for nucleic acids
• K = 11
• Exact match instead of neighborhood
search.
BLAST Variations
Program
Query
Target
Type
BLASTP
Protein
Protein
Gapped
BLASTN
Nucleic acid
Nucleic acid
Gapped
BLASTX
Nucleic acid
Protein
Gapped
TBLASTN
Protein
Nucleic acid
Gapped
TBLASTX
Protein
Nucleic acid
Gapped
Even More Variations
– PsiBLAST (iterative)
– BLAT, BLASTZ, MegaBLAST
– FLASH, PatternHunter, SSAHA, SENSEI,
WABA, GLASS
– Main differences are
• Seed choice (k, gapped seeds)
• Additional data structures
Suffix Tree
• Tree structure that contains all suffixes of the input sequence
•
•
•
•
•
•
•
•
•
TGAGTGCGA
GAGTGCGA
AGTGCGA
GTGCGA
TGCGA
GCGA
CGA
GA
A
Suffix Tree Example
Suffix Tree Analysis
• O(n) space and construction time
– 10n to 70n space usage reported
• O(m) search time for m-letter sequence
• Good for
– Small data
– Exact matches
Suffix Array
• 5 bytes per letter
• O(m log n) search
time
• Better space usage
• Slower search
Mummer
Other Sequence Comparison Tools
• Reputer, MGA, AVID
• QUASAR (suffix array)
Uses of sequence alignment
•
•
•
•
•
•
Search databases
Assess similarity, relatedness
Identify structural variations (point, gross)
Determine specificity of primers
Evaluate complexity of a sequence
Assemble sequence de novo