Download Database Searches for similar sequences

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Transposable element wikipedia , lookup

Genetic code wikipedia , lookup

DNA barcoding wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Pathogenomics wikipedia , lookup

Gene wikipedia , lookup

Genome evolution wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Human genome wikipedia , lookup

Point mutation wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Metagenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genomics wikipedia , lookup

Genome editing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Sequence alignment wikipedia , lookup

Transcript
Bioinformatics
s
lie Ann
mi
ota
Fa
tion
n
tei
o
r
P
Genome
Annotation
• Annotation in bioinformatics: Function,
intron-exon-boundaries, regulatory
sequences, repeats, gene names and protein
products, etc.
• This information is obtained by searching
similar sequences in databases.
Expression
Compare
Molecule
?
ains
Dom
Compare
Similar Proteins?
Expression
TLR3
Orthology/paralogy/homology
TLR1
TLR1
TLR1
TLR1
TLR2
TLR2
TLR2
TLR2
TLR
Orthologous genes are homologous (corresponding) genes
in different species
Paralogous genes are homologous genes within the same
species (genome)
Database Searches for
similar sequences
How can we search for
domains in a genome?
MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI
LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP
STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD
YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV
GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS
How can we search for similar
sequences in a database?
Alignment
Alignment
HMM
HMM
Search
What if we don't have a multiple
alignment to start with?
MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI
LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP
STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD
YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV
GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS
MQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPPDQQRLI
LRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQDKEGIPP
STLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNVKAKIQD
YNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESSDTIDNV
GRTLADYNIQKESTLHLVLRLRGGMQIFVKTLTGKTITLEVESS
Global versus local alignments
Global alignment: align full length of both sequences.
(The “Needleman-Wunsch” algorithm).
Global alignment
Local alignment: find best partial alignment of two
sequences (the “Smith-Waterman” algorithm).
Seq 1
Local alignment
Seq 2
We can search by aligning query to
every single protein sequence in genomes!
Search
Local alignment overview
Local alignment: example
• The
recursive formula is changed by adding
a fourth possibility: zero. This means local
alignment scores are never negative.
H(i,j) = max
H(i-1,j-1) + S(xi,yj)
diagonal
H(i,j-1) – gap-penalty
horizontal
H(i-1,j) – gap-penalty
vertical
0
• Trace-back is started at the highest value
rather than in lower right corner
• Trace-back is stopped as soon as a zero is
encountered
Local alignment: example
But this is too slow!
• consider the task of searching SWISS-PROT
against a query sequence:
– say our query sequence is 362 amino-acids long
– SWISS-PROT release 55 (18-Mar-08) contains
129,199,355 amino acids
X
• finding local alignments for this query via dynamic
programming would entail approx. 5 x 1010 matrix
operations
• many servers handle thousands of such queries a
day (NCBI > 50,000)
Sequence database searching
• Local alignments
Too slow for
repeated database
searches
Heuristic Alignment
• Heuristic alignment: An alignment that is
pretty good, but you can not prove that it is
best.
• BLAST: have tricks that people have come
• BLAST
• PSI-BLAST
Fast methods
BLAST
• Basic Local Alignment Search Tool
• Aim: find as much as possible good
matches with reasonable speed
• Let's go through step by step to learn
how BLAST works!
up with to make alignment and database
searching fast, without losing too much
quality.
Blast, I: Indexing
• The program makes an index by dividing
every sequence in the database to words of a
defined size (W).
– Default W=11 DNA sequences
– Protein sequences the default W=3
• When we run BLAST it creates words from
our query as well
Database sequence (N = 46 bases)
Blast II, Initial Searching
GTGTCAGCTAACGGCCGTTACGATGCTAAAGCTATACGATTAGCG
• Each word in the query sequence is compared to the database
index and residue pairs are scored
Words (W = 11)
GTGTCAGCTAA
– For DNA sequences a match is +1, a mismatch is – 3
– For protein sequences, scores for matches and mismatches are
based on a substitution matrix
TGTCAGCTAAC
GTCAGCTAACG
TCAGCTAACGG
CAGCTAACCGC
etc.
Query sequence (N=27 bases)
TCATATCACGGCCCTTCGGACCTGAGG
TCATATCACGG
• The score for each word pair is the sum of the scores for each
pair of residues
• Matching words scoring above a threshold (T) are retained
for further analysis
– DNA: T = 0, Protein: T = 11
CATATCACCGC
ATATCACCGCC
etc.
Database sequence (N = 46 bases)
GTGTCAGCTAACGGCCGTTACGATGCTAAAGCTATACGATTAGCG
Words (W = 11)
Alignments
TCAGATCACGG
GTGTCAGCTAA
|
| |
TGTCAGCTAAC
TGTCAGCTAAC (3 * 1) + (8 * -3) = -21
GTCAGCTAACG
TCAGATCACGG
TCAGCTAACGG
CAGCTAACCGC
etc.
Query sequence (N=27 bases)
|
|
GTCAGCTAACG (2 * 1) + (9 * -3) = -25
TCAGATCACGGCCCTTCGGACCTGAGG
TCAGATCACGG
TCAGATCACGG
|||| | ||||
CAGATCACCGC
AGATCACCGCC
TCAGCTAACGG (9 * 1) + (2 * -3) = 3
etc.
Blast III, Extending Hits
• extend hits in both directions (with or without
allowing gaps)
• Residues will be added until the incremental score
drops below a threshold (S)
Alignments
TCAGATCACGG
|||| | ||||
TCAGCTAACGG (9 * 1) + (2 * -3) = 3
Extension
TCAGATCACGGC
|||| | |||||
TCAGCTAACGGC (10 * 1) + (2 * -3) = 4
TCAGATCACGGCC
|||| | ||||||
TCAGCTAACGGCC (11 * 1) + (2 * -3) = 5
• Stretches of similar regions are called HSPs (high scoring
segment pairs)
BLAST Notes
• may fail to find all HSPs
– may miss seeds if T is too stringent
• 10 to 50 times faster than Smith-Waterman
• large impact:
– NCBI’s BLAST server handles more than 50,000
queries a day
– most used bioinformatics program
• The T parameter is the most important for
the speed and quality of the search for
HSPs:
small T: more hits to expand, more False Positives
large T: fewer hits to expand, fewer False Positives
TCAGATCACGGCCCAACGGACCTGAGG
|||| | ||||||
||
|
TCAGCTAACGGCCGTTACGATGCTAAA (11 - 39) = -28
BLAST ‘flavors’
• blastp compares an amino acid query
sequence against a protein sequence database
• blastn compares a nucleotide query sequence
against a nucleotide sequence database
• blastx compares the six-frame protein
translation products of a nucleotide query
sequence against a protein sequence database
• tblastn compares a protein query sequence
against a nucleotide sequence database
translated in six reading frames
• tblastx searches translated nucleotide
database using a translated nucleotide query
BLAST ‘flavors’ for nucleotide
sequences
1 - This portion of each description links to the sequence record for a particular hit.
2 - Score or bit score is a value calculated from the number of gaps and substitutions
associated with each aligned sequence. The higher the score, the more significant the
alignment. Each score links to the corresponding pairwise alignment between query sequence
and hit sequence (also referred to as subject sequence).
3 - E Value (Expect Value) describes the likelihood that a sequence with a similar score will
occur in the database by chance. The smaller the E Value, the more significant the alignment.
For example, the first alignment has a very low E value of e-117 meaning that a sequence with a
similar score is very unlikely to occur simply by chance.
4 - These links provide the user with direct access from BLAST results to related entries in
other databases. ‘L’ links to LocusLink records and ‘S’ links to structure records in NCBI's
Molecular Modeling DataBase.
When is a database hit
significant?
• Problem:
– Even unrelated sequences can be aligned (yielding a
low score) and thus can give a BLAST hit
– How do we know if a database hit is meaningful?
– When is an alignment score sufficiently high?
• Solution:
– Determine the range of alignment scores you would
expect to get for random reasons (i.e., when aligning
unrelated sequences).
– Compare actual scores to the distribution of random
scores.
– Is the real score much higher than you’d expect by
chance?
• Megablast is specifically designed to
efficiently find long alignments between very
similar sequences and thus is the best tool to
use to find the identical match to your query
sequence (W=28)
• discontiguous MEGABLAST is better at
finding nucleotide sequences similar, but not
identical, to your nucleotide query. The
search is focused on first and second codons.
Database searching:
E-values in BLAST
BLAST uses precomputed distributions to
calculate the chance that the hit found could be
due to random reasons: Correction for subs. matrix
Score
Correction for
search space
Size of query seq.
Database size
A word of caution:
E-values in BLAST
Low complexity regions
Plasmodium falciparum
BLAST tends to overestimate the
significance of its matches.
E-values from BLAST are fine for
identifying sure hits. One should be
careful using BLAST’s E-values to judge
if a marginal hit can be trusted. You may
want to use E-values of 10-4 to 10-5
>SERA_PLAFG (P13823):
MKSYISLFFILCVIFNKNVIKCTGESQTGNTGGGQAGNTVGDQAGSTGGSPQGSTGASQPGSSEPSNPVSSGHSVSTVSV
SQTSTSSEKQDTIQVKSALLKDYMGLKVTGPCNENFIMFLVPHIYIDVDTEDTNIELRTTLKETNNAISFESNSGSLEKK
KYVKLPSNGTTGEQGSSTGTVRGDTEPISDSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSESLPANGPDSPTVKP
PRNLQNICETGKNFKLVVYIKENTLIIKWKVYGETKDTTENNKVDVRKYLINEKETPFTSILIHAYKEHNGTNLIESKNY
ALGSDIPEKCDTLASNCFLSGNFNIEKCFQCALLVEKENKNDVCYKYLSEDIVSNFKEIKAETEDDDEDDYTEYKLTESI
DNILVKMFKTNENNDKSELIKLEEVDDSLKLELMNYCSLLKDVDTTGTLDNYGMGNEMDIFNNLKRLLIYHSEENINTLK
NKFRNAAVCLKNVDDWIVNKRGLVLPELNYDLEYFNEHLYNDKNSPEDKDNKGKGVVHVDTTLEKEDTLSYDNSDNMFCN
KEYCNRLKDENNCISNLQVEDQGNCDTSWIFASKYHLETIRCMKGYEPTKISALYVANCYKGEHKDRCDEGSSPMEFLQI
IEDYGFLPAESNYPYNYVKVGEQCPKVEDHWMNLWDNGKILHNKNEPNSLDGKGYTAYESERFHDNMDAFVKIIKTEVMN
KGSVIAYIKAENVMGYEFSGKKVQNLCGDDTADHAVNIVGYGNYVNSEGEKKSYWIVRNSWGPYWGDEGYFKVDMYGPTH
CHFNFIHSVVIFNVDLPMNNKTTKKESKIYDYYLKASPEFYHNLYFKNFNVGKKNLFSEKEDNENNKKLGNNYIIFGQDT
AGSGQSGKESNTALESAGTSNEVSERVHVYHILKHIKDGKIRMGMRKYIDTQDVNKKHSCTRSYAFNPENYEKCVNLCNV
•These regions help evolving fast?
•Result of recombination events?
•Replication mistakes?
They make database
searches difficult!
PSI (Position Specific Iterated)
BLAST
• basic idea
– use results from BLAST query to construct a
HMM (without insertions and deletions)
– search database with this HMM
“Small letters” denote low-complexity sequence
fragments that are ignored
Orthology/paralogy
PSI-BLAST iteration
Q
xxxxxxxxxxxxxxxxx
Q
xxxxxxxxxxxxxxxxx
Query sequence
BLAST search
Query sequence
Database hits
iterate
A
C
D
.
.
.
.
.
Y
TLR1
TLR1
TLR2
TLR1
TLR2
TLR2
HMM
TLR1
HMM search
TLR2
TLR
Database hits
Operational definition of orthology
Bi-directional best hit:
• Blast gene A in genome 1 against genome 2:
gene B is best hit
• Blast gene B against genome 1: if gene A is
best hit
à A and B are orthologous
A number of other criteria is also in use (part of
which is based on phylogeny)
Impact of using PSI-BLAST
Purple sea urchin
genome is available
(November 2006)
Does sea urchin have Toll-like receptors?