Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Übungen zur Vorlesung “Softwarewerkzeuge der Bioinformatik” SS03 BLAST-Tutorial A) Short introduction: Filter Matrix E value Gap costs default quick search on BLOSUM62 10 11, 1 off PAM30 1000 11, 1 special conditions Search for sequence-families on BLOSUM62 1-10 11, 1 BLAST without gaps on BLOSUM62 10 12, 1 1. Filter: The filter masks off segments that have low compositional complexity. Thus the filtering can eliminate statistically significant but biologically uninteresting reports from the output list. It should be off for fast queries and those where a high amount of outputs plays an important role. 2. Exchange matrices: The BLOSUM62 matrix is provides „all-round“ qualities when compared to other matrices and tends to give the best results on default. Even in cases of low protein-homology being expected, BLOSUM62 seems to be a great choice as well. 3. Gaps represent insertions and deletions, which happened during the evolution. To balance the insertion of gaps while aligning the sequences, each insertion of a gap costs a penalty. Gap-scores are separated into penalties for opening a gap and extending an existing gap. BLAST 2 automatically inserts gaps. Anyway, it is possible to launch an ungapped BLAST 2 alignment too by choosing the worst penalty for opening a gap. This is interesting for alignments with fixed sequence-length. 4. Alignment scores: a. S value: Also known as raw score. This value is the sum over all scores that are derived by each residue comparison between those two sequences, considering the amino acid exchange matrix and the penalties for Gap-opening and -extend. b. P value: the P value tells how significant the S value is and allows an efficient way of distinguishing true homologies from chance similarities. When a high amount of randomly found sequences with the same or a higher S value occur, than the P value will be high as well and show a low significance for the S value of the sequence. c. E value: the E value additionally considers the size of the database. Compared to the P value, the E value tells shows the expected number of sequences that would be found randomly to be at least as high or higher than the scored alignment. The E value is derived by multiplication of the P value with the amount of sequences in the chosen database. A lower value indicates higher significance for the score. d. Choosing 10 for the E value means, that you expect around 10 randomly found sequences in the database to have at least the same score if not a higher score. B) A simple query: 1. Go to http://www.ncbi.nlm.nih.gov/BLAST/ 2. Choose under Protein Protein-protein BLAST (blastp) 3. This is the BLASTP web interface. To paste a sequence we need to find a protein sequence from SRS. Find the sequence of the protein with the AccNumber: P00042. Copy the sequence and paste it into the „Search“ box. Make sure the sequence is shown in FASTA-format. (Hint: use „View“ in SRS) 4. Choose a database. Here swissprot. What are the other databases good for? 5. Clicking on BLAST! will launch the query. 6. Now you will find a new window to customize the output. For now simply use default and click on Format! to process it. C) An extended query: 1. First we need the protein sequence with the Description-Number MJ0577 which is from the organism Methanococcus jannaschii. 2. Copy and paste the sequence in FASTA-format into the „Search“ box. 3. Choose a database. Which database would you prefer for a detailed alignment? 4. Choose „1“ for the E-value in the Expect box. Reducing the E value from 10 to 1 leads to a much more rigorous alignment with more significant scores. This will need some more calculation time. 5. Leave the Low complexity filter marked so that the alignment hides biological unimportant but statistical significant hits. The other filters are not yet developed properly so leave them unmarked for now. 6. Choose the BLOSUM62 matrix and use a gap existence cost of 11 and a gap cost of 1 (should be default). 7. Define the output in the Format category. For now just leave all as default. 8. Click on BLAST! to process. 9. Again just click on Format! to leave all views as default. This might take some time now. Feel free to snoop around the previous page and ask the tutor as many questions as you can find. E) Interpretation of the results: The graphical figure of the alignment shows max. 50 sequences. Clicking each line leads you to the current pair wise alignment. The color of the lines indicates the score. Scrolling down will show you the sequences ordered by the grade of homology descending. You will find a list of 100 sequences according to the value next to the field Descriptions. Further down you will find all 50 pair wise alignments (number in the field Alignments). PSI-BLAST Tutorial A) Short introduction: The additional sensitivity of this program toward BLAST derives from a profile, which is generated automatically or also as PSSM (position specific scoring matrix), which you can add manually. This profile contains a list of frequencies for the appearance of specific amino acids at specific positions in the protein-sequence. These frequencies derived from multiple sequence alignments of the highest scoring sequences in the first iteration of the PSIBLAST search passing a threshold. Therefore highly conserved positions get a higher score than just by the amino acid exchange matrix. PSI-BLAST (position specific iterative BLAST) can be used, when one looks for far members of a protein family, whose relationship does not come out from direct sequence comparisons. You can use PSI-BLAST also with hypothetical proteins, in order to be able to arrange their function, without that they are even annotated in any database. The interface of PSI-BLAST and BLAST is identical. For PSI-BLAST there are just some further options available. B) An example: 1. Proceed like in the BLAST Tutorial C and choose PHI- and PSI-BLAST 2. Insert the protein-sequence of MJ0577 of the organismus Methanococcus jannaschii into the Search box. 3. Pick the most suggestive database. 4. For the E value (Expect) under Options choose 1 instead of 10. 5. Leave all settings at default and go to the PSI-Blast settings 6. Under Format for PSI-BLAST choose with inclusion threshold and use a threshold of 0,001 7. Launch the first iteration. 8. The shown results derived from a regular BLAST search, so there shouldn’t be any difference to the previous exercise with BLAST. 9. Launching another iteration will change the results. Compare the results and find out what changed. What happens after you launched another iteration?