Download BLAST Tutorial:

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Molecular ecology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Übungen zur Vorlesung “Softwarewerkzeuge der Bioinformatik” SS03
BLAST-Tutorial
A) Short introduction:
Filter
Matrix
E value
Gap costs
default
quick search
on
BLOSUM62
10
11, 1
off
PAM30
1000
11, 1
special conditions
Search for
sequence-families
on
BLOSUM62
1-10
11, 1
BLAST
without gaps
on
BLOSUM62
10
12, 1
1. Filter: The filter masks off segments that have low compositional complexity.
Thus the filtering can eliminate statistically significant but biologically
uninteresting reports from the output list. It should be off for fast queries and
those where a high amount of outputs plays an important role.
2. Exchange matrices: The BLOSUM62 matrix is provides „all-round“ qualities
when compared to other matrices and tends to give the best results on default.
Even in cases of low protein-homology being expected, BLOSUM62 seems to
be a great choice as well.
3. Gaps represent insertions and deletions, which happened during the evolution.
To balance the insertion of gaps while aligning the sequences, each insertion
of a gap costs a penalty. Gap-scores are separated into penalties for opening a
gap and extending an existing gap. BLAST 2 automatically inserts gaps.
Anyway, it is possible to launch an ungapped BLAST 2 alignment too by
choosing the worst penalty for opening a gap. This is interesting for
alignments with fixed sequence-length.
4. Alignment scores:
a. S value: Also known as raw score. This value is the sum over all
scores that are derived by each residue comparison between those two
sequences, considering the amino acid exchange matrix and the
penalties for Gap-opening and -extend.
b. P value: the P value tells how significant the S value is and allows an
efficient way of distinguishing true homologies from chance
similarities. When a high amount of randomly found sequences with
the same or a higher S value occur, than the P value will be high as
well and show a low significance for the S value of the sequence.
c. E value: the E value additionally considers the size of the database.
Compared to the P value, the E value tells shows the expected number
of sequences that would be found randomly to be at least as high or
higher than the scored alignment. The E value is derived by
multiplication of the P value with the amount of sequences in the
chosen database. A lower value indicates higher significance for the
score.
d. Choosing 10 for the E value means, that you expect around 10
randomly found sequences in the database to have at least the same
score if not a higher score.
B) A simple query:
1. Go to http://www.ncbi.nlm.nih.gov/BLAST/
2. Choose under Protein Protein-protein BLAST (blastp)
3. This is the BLASTP web interface. To paste a sequence we need to find a
protein sequence from SRS. Find the sequence of the protein with the AccNumber: P00042. Copy the sequence and paste it into the „Search“ box. Make
sure the sequence is shown in FASTA-format. (Hint: use „View“ in SRS)
4. Choose a database. Here swissprot. What are the other databases good for?
5. Clicking on BLAST! will launch the query.
6. Now you will find a new window to customize the output. For now simply use
default and click on Format! to process it.
C) An extended query:
1. First we need the protein sequence with the Description-Number MJ0577
which is from the organism Methanococcus jannaschii.
2. Copy and paste the sequence in FASTA-format into the „Search“ box.
3. Choose a database. Which database would you prefer for a detailed alignment?
4. Choose „1“ for the E-value in the Expect box. Reducing the E value from 10
to 1 leads to a much more rigorous alignment with more significant scores.
This will need some more calculation time.
5. Leave the Low complexity filter marked so that the alignment hides biological
unimportant but statistical significant hits. The other filters are not yet
developed properly so leave them unmarked for now.
6. Choose the BLOSUM62 matrix and use a gap existence cost of 11 and a gap
cost of 1 (should be default).
7. Define the output in the Format category. For now just leave all as default.
8. Click on BLAST! to process.
9. Again just click on Format! to leave all views as default. This might take
some time now. Feel free to snoop around the previous page and ask the tutor
as many questions as you can find.
E) Interpretation of the results:
The graphical figure of the alignment shows max. 50 sequences. Clicking each
line leads you to the current pair wise alignment. The color of the lines
indicates the score. Scrolling down will show you the sequences ordered by
the grade of homology descending. You will find a list of 100 sequences
according to the value next to the field Descriptions. Further down you will
find all 50 pair wise alignments (number in the field Alignments).
PSI-BLAST Tutorial
A) Short introduction:
The additional sensitivity of this program toward BLAST derives from a
profile, which is generated automatically or also as PSSM (position specific
scoring matrix), which you can add manually. This profile contains a list of
frequencies for the appearance of specific amino acids at specific positions in
the protein-sequence. These frequencies derived from multiple sequence
alignments of the highest scoring sequences in the first iteration of the PSIBLAST search passing a threshold. Therefore highly conserved positions get a
higher score than just by the amino acid exchange matrix. PSI-BLAST
(position specific iterative BLAST) can be used, when one looks for far
members of a protein family, whose relationship does not come out from
direct sequence comparisons. You can use PSI-BLAST also with hypothetical
proteins, in order to be able to arrange their function, without that they are
even annotated in any database. The interface of PSI-BLAST and BLAST is
identical. For PSI-BLAST there are just some further options available.
B) An example:
1. Proceed like in the BLAST Tutorial C and choose PHI- and PSI-BLAST
2. Insert the protein-sequence of MJ0577 of the organismus Methanococcus
jannaschii into the Search box.
3. Pick the most suggestive database.
4. For the E value (Expect) under Options choose 1 instead of 10.
5. Leave all settings at default and go to the PSI-Blast settings
6. Under Format for PSI-BLAST choose with inclusion threshold and use a
threshold of 0,001
7. Launch the first iteration.
8. The shown results derived from a regular BLAST search, so there shouldn’t be
any difference to the previous exercise with BLAST.
9. Launching another iteration will change the results. Compare the results and
find out what changed. What happens after you launched another iteration?