Download Sequence database similarity search

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Intrinsically disordered proteins wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Sequence database similarity search
-Input: sequence query
-Output: list of similar sequences (“hits”) found in the database
-
Sequence database similarity search implies pairwise alignments
of the query to all entries in the database.
-
A straightforward dynamic programming algorithm is not efficient
in this case (slow).
-
A faster search can be realized using search for “words”:
stretches of similar oligomers in two sequences ( the query and a
subject sequence from the database).
FASTA
Biochemistry: Pearson and Lipman
A
100
50
50
B
166F
Proc. Natl. Acad. Sci. USA 85 (1988)
(Pearson & Lipman, 1988)
100
1
N
''\X\\\'
I
50
50
\\'
*
\\
I
100.
100
\\ *
C50
'~\'
\
\l
100
50
100
FIG. 1. Identification of sequence similarities by FASTA. The
four steps used by the FASTA program to calculate the initial and
optimal similarity scores between two sequences are shown. (A)
Identify regions of identity. (B) Scan the regions using a scoring
2445
only the band around each initial region but also potential
after the
distance
beforewith
and at
sequence
alignments
for some for
(A) A search
for “words”,
instance,
initialdiagonal
initial
Starting atmatches.
the end of
the any
region, an
leastregion.
4 consecutive
Then
optimization
(6) complementarity
proceeds in the reverse
direction
region in the
matrix
can beuntil all
possible alignment scores have gone to zero. The location of
using
formula
intodirection is
thescored
maximal
localsome
similarity
score that
in thetakes
reverse
account
words andthat
the proceeds
distances
in the
then
used tothe
startnumber
a secondofoptimization
between
them.An
The
best path
(say,starting
10) diagonal
forward
direction.
from the forward
optimal
regionsisare
thisThe
step.
maximum
thenselected
displayedat(5).
local homologies can be
displayed as sequence alignments (see Fig. 2B) or on a
two-dimensional
matrix
graphic
plot (seeusing
Figs.a2A and
(B) The best 10
regions
arestyle
rescored
3).substitution matrix, allowing conservative
Statistical Significance. The rapid sequence to
substitutions and shorter runs of identitiescomparison
algorithms we have developed also provide additional tools
similaritysignificance
score. Theofsubregions
forcontribute
evaluatingtothethestatistical
an alignment.
with are
maximal
scores for
of the
best with 1.1
There
approximately
5000each
protein
sequences,
diagonals
(“initial
regions”).
million
aminoare
acididentified
residues, in
the NBRF
protein sequence
library, and any computer program that searches the library
a similarity
by(C)
calculating
sequence in the
The algorithm
joinsscore
somefor
of each
the compatible
library
find acomputing
highest scoring
sequence,
regardless of
initialwill
regions,
an optimal
combination.
whether the alignment between the query and library sequence is biologically meaningful or not. Accompanying the
(D) Within
some
band centered
around the
previous
version
of FASTP
was a program for the evaluation
scoring
intitial region,
the alignment
one seofhighest
statistical
significance,
RDF, which
compares is
recalculated
using permuted
a dynamic
programming
with randomly
of the potentially
quence
versions
related
sequence.
algorithm.
We have written a new version of RDF (RDF2) that has
several improvements. (i) RDF2 calculates three scores for
each shuffled sequence: one from the best single initial region
(as found by FASTP), a second from the joined initial regions
(used by FASTA), and a third from the optimized diagonal.
BLAST: Basic Local Alignment Search Tool
BLAST is the most popular program for sequence database similarity search.
Seq1
First publicartion: Altschul et al. (1990).
S≥T
Main strategy:
-
Searching for “words” in a subject sequence from the database satisfying a
criterion of a word size at least W and a score (S) at least T compared to a
word in the query.
-
If a word is found, BLAST algorithm attempts to extend it and improve the
score S.
-
The algorithm is designed for local alignments: if further extension does not
improve S, the alignment region between the query and the subject sequence
(“sequence hit”) with the maximal S is returned to the user.
- The result of BLAST is a list of hits, ordered according to their significance (Evalues).
W
Seq2
BLAST: Basic Local Alignment Search Tool
Say, searching with a query:
...FDRIGDGETKLVTPVPT...
“w-mers”: words that score at least T when compared to some word (e.g. VTP) in the query.
With W=3; T=11 and BLOSUM62 matrix, w-mer scores calculated for VTP:
VTP 16
ITP 15
LTP 13
MTP 13
ATP 12
TTP 12
CTP 11
FTP 11
YTP 11
VSP 12
VAP 11
VNP 11
VVP 11
Subject ...VDQHGAPPEQRITPRQQ...
contains ITP (S=15) => the algorithm proceeds with the extension phase
(e.g. alignment by dynamic programming)
Query
Sbjct
...FDRIGDGETKLVTPVPT...
...VDQHGAPPEQRITPRQQ...
Score improved ?
Word extension search in the original BLAST algorithm
(Altschul et al., 1990)
Score
HSP, high-scoring segment pair
X: significance decay
S: minimum score to return a hit in the output
T : word threshold
Extension length
The statistics of pairwise alignments
Expected number (E-value) of ungapped HSPs with score at least S in the
alignment of sequences with sufficiently large lengths m and n:
E = K m n exp (- 𝛌S),
where K and 𝛌 depend on scoring system and monomer frequencies.
Normalized raw score
S’ = ( 𝛌S - lnK) / ln2
is a “bit score” characterizing HSP significance : E = m n 2-S’
(not dependent on scoring system).
For gapped local alignments the statistics can be determined from large-scale
comparisons of quasi-random sequences.
The statistics of pairwise alignments
Expected number (E-value) of ungapped HSPs with score at least S in the
alignment of sequences with sufficiently large lengths m and n:
E = K m n exp (- 𝛌S),
where K and 𝛌 depend on scoring system and monomer frequencies.
Normalized raw score
S’ = ( 𝛌S - lnK) / ln2
is a “bit score” characterizing HSP significance : E = m n 2-S’
(not dependent on scoring system).
For gapped local alignments the statistics can be determined from large-scale
comparisons of quasi-random sequences.
Global alignments: no general statistical theory. Significance can be
determined by generating a large number of alignments of permuted
sequences (the same lengths and monomer frequencies as those of
sequences in question).
Gapped BLAST
(Altschul et al., 1997)
Two-hit approach: initial search for two non-overlapping hits of score at least T,
within a distance A of one another on a diagonal in sequence space.
S≥T
Two-hit approach: initial search for two nonoverlapping hits of score at least T, within a
distance A of one another on a diagonal in
sequence space:
S≥T
A
Ungapped extension:
If ungapped extension is better than some threshold Sg.
E.g. chosen so that not more than one gapped extension is
invoked per 50 database sequences, corresponding to
Sg = 22 bits:
Gapped extension is triggered.
W
S’ ≥ Sg
BMC Bioinformatics 2009, 10:421
http://www.biomedcentral.com/1471-2105/10/4
BLAST+
(Camacho et al., 2009)
Scanning
N
More
sequence?
Setup
Trace-back
Y
Read query
Find word
matches
Read options
Calculate improved
score and
insertions/deletions
Gap free
extensions
Mask query
Gapped
extensions
Build lookup
table
N
Matches?
Y
Save hits
Figure 1 of a BLAST search
Schematic
Schematic of a BLAST search. The first phase is "setup". The query is read, low-complexity or other filtering might be
Exercises
(Sequence databases, sequence alignment, sequence database similarity search)
1. The NCBI reference sequence of human beta-globin mRNA has the accession NM_000518. What is the accession number of the
encoded protein ? How many amino acids does it contain ?
2. A plant transcription factor TOE1 (TARGET OF EARLY ACTIVATION TAGGED 1) was recently suggested to regulate development of
specialized organs called nodules in legume plants, e.g. in soybean. Using the TOE1 sequence from the model organism Arabidopsis
thaliana (NP_001189625), search for similar proteins in soybean (Glycine max) using BLAST program. How many hits with Glycine max
proteins are yielded by BLAST search ? What is the name of protein and accession number yielded as the best G.max hit ? Is its similarity
to the TOE1 query significant ? Give the E-value of this alignment, the number of identical amino acids and the number of conservative
substitutions (with positive scores).
3. The following DNA sequence fragment, containing some mutation, was isolated from a patient:
tttgctccccgcgcgctgtttttctcagtgactttcagcgggcggaaaag In what gene the mutation is located? On which chromosome? How many nucleotides are changed? 4. A special option of BLAST for pairwise alignment of two sequences (bl2seq) is sometimes a quick way to determine similarity between
two closely related sequences.
(a) For instance, determine identity percentage in the alignment of two genomes of Zika virus: the one of an isolate from Suriname,
October 2015 (accession KU312312) and the reference genome (NC_012532).
- Also, using bl2seq for proteins, determine identities and positives in the alignment of polyproteins encoded by these two genomes. Locate
the amino positions of gaps in the alignment: what are the residues inserted (or deleted) in the Suriname isolate as compared to the
reference genome?
- Are there insertions or deletions in the Suriname isolate polyprotein as compared to the polyprotein of Zika virus from a French Polynesia
outbreak in 2013 (KJ776791) ?
(b) Is this BLAST option also optimal for the proteins of assignment 2: A.thaliana TOE1 (NP_001189625) and its homolog from G.max ?
What is the alternative algorithm/program in this case ? Determine the percentage of similarity between these two proteins.
Send your solutions to [email protected]
Exercises
(Sequence databases, sequence alignment, sequence database similarity search)
5. It has been reported that a transcript annotated as a long non-coding RNA in mouse genome encodes a peptide of 34 amino acids with
the following sequence:
MAEKESTSPHLIVPILLLVGWIVGCIIVIYIVFF.
It was also suggested that a transcript annotated as a long non-coding RNA in human genome (Accession NR_037902) might also contain
a small open reading frame (ORF) encoding similar peptide. Determine the nucleotide positions of this ORF in the human transcript, the
sequence of the peptide and its length.
6. Using ENTREZ Gene database, determine the differences between alternative splicing isoforms of the human microtubule-associated
protein tau (MAPT, GeneID 4137). How many exons are contained in the tau gene according to the RefSeqGene data? How many exons
do alternative transcripts lack?
7. Calculate pairwise alignments of two homologous segments (PA-segments, Accessions CY046942 and EF626633) of influenza A and B
viruses using different algorithms:
(a) Make the global optimal alignment by needle algorithm (www.ebi.ac.uk/Tools/emboss/align/);
(b) Using the option of BLAST for two sequences (bl2seq), align these two nucleotide sequences with blastn algorithm;
(c) Use bl2seq again, but with tblastx: alignment of translated nucleotide sequences.
What are the main differences between the alignment results? Try to explain the origins of these differences. What are the advantages/
disadvantages of each of these approaches in this case?
Send your solutions to [email protected]