Download w0506_tutorial3_06

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Magnesium transporter wikipedia , lookup

QPNC-PAGE wikipedia , lookup

Gene expression wikipedia , lookup

Protein (nutrient) wikipedia , lookup

Protein moonlighting wikipedia , lookup

List of types of proteins wikipedia , lookup

Western blot wikipedia , lookup

Non-coding DNA wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Genomic library wikipedia , lookup

Protein adsorption wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Genetic code wikipedia , lookup

Proteolysis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Molecular evolution wikipedia , lookup

Point mutation wikipedia , lookup

Structural alignment wikipedia , lookup

Homology modeling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Introduction to Bioinformatics Tutorial no. 2
BLAST
BLAST
BLAST – Outline



Sequence Alignment
Complexity and indexing
BLASTN and BLASTP




Basic parameters
PAM and BLOSUM matrices
Affine gap model
E Values (once again)
Advanced BLAST





Databases
BLAST options
BLAST output
Taxonomic BLAST
Pairwise BLAST
BLAST Variations
Name
Query type
Database
blastn
Genomic
Genomic
blastp
Protein
Protein
blastx Translated genomic
tblastn
Protein
Protein
Translated genomic
tblastx Translated genomic Translated genomic

Genomic translations test all 6 possibilities:
3x for codon frames, 2x for reverse complement
BLASTN Databases
nr
GenBank, EMBL, DDBJ, PDB and NCBI
reference sequences (RefSeq)
htgs
High-throughput genomic sequences (draft)
pat
Patented nucleotide sequences
mito
Mitochondrial sequences
vector
Vector subset of GenBank
month GenBank, EMBL, DDBJ, PDB from 30 days
chrom
Contigs and chromosomes from RefSeq
BLASTP Databases
nr
GenBank CDS translations, RefSeq,
PDB, SWISS-PROT, PIR, PRF
swissprot
SWISS-PROT
pat
Patented protein sequences
pdb
Protein Data Bank
month
GenBank CDS translations, PDB,
SWISS-PROT, PIR, PRF from 30 days
BLASTN/P Options (1)
Only search part of database using
NCBI Entrez query format
Search
specific
organism
Remove low information
content, e.g. short repeats or
rich in only 2 nucleotides
Remove known
human repeats
(LINEs, SINEs)
BLASTN/P Options (2)
Threshold
for results
significance
Use index based on
words of 7, 11 or 15
nucleotides
Costs to open and extend
gap, score for nucleotide
match or mismatch.
Allowed gap scores:
10/1, 10/2, 11/1, 8/2, 9/2
BLASTP Options
Scoring matrix:
PAM, etc…
Costs to open and
extend gap
Search for a motif (PSI-BLAST)
BLASTN/P Formatting (1)
Show colored
bar chart
Other (less
important)
options on
what to show
Number of
sequences listed
Number of
alignments shown
BLASTN/P Formatting (2)
How to display alignments
Only show results
which match Entrez
search or are from
specific organism
Only show results with E values in this range
BLASTN Results
Query sequence
representation
Matched areas
of database
sequences
BLAST Output Header
Request ID for later retrieval
Query sequence details
Database details
Tax BLAST
BLAST Alignments (1)
Sequence
Identifier
Sequence
description
Score and
E value
BLAST Alignments (2)
Normalized score of
alignment
Expected number of such
hits (2e-11 = 2  10-11)
Number of
insertion /
deletions
Number
of exact
matches
Number of
matches with
positive score
BLAST Alignments (3)
Insertion / deletion
Exact match
Query sequence
Mismatch
with positive
score
Matched
sequence
Position within
sequence
Masked low complexity region
Expectation Values
Increases
linearly with
length of query
sequence
Increases
linearly with
length of
database
Decreases
exponentially
with score of
alignment
Tax BLAST
Lineage of organism
with strongest hit
Shared ancestry in
taxonomic tree
Score of
organism’s
strongest hit
Number of
organism hits
BLAST2SEQ
Type of
program
This tool produces the alignment of two given
sequences using BLAST engine for local alignmentScoring
.
matrix
Scoring scheme
Gap model,
Expect Value,
Advanced options
GO !
Sequences
Sequences
Questions
You have two query sequences: query1 and query2:
>query1
CCGTCCGTCCGTCGTCCTCCTCGCTTGCGGGGCGCCGGGCCCGTCCTCGAGCCCCCNNNNNCCGTCCGGC
CGCGTCGGGGCCTCGCCGCGCTCTACCTACCTACCTGGTTGATCCTGCCAGTAGCATATGCTTGTCTCAA
AGATTAAGCCATGCATGTCTAAGTACGCACGGCCGGTACAGTGAAACTGCGAATGGCTCATTAAATCAGT
TATGGTTCCTTTGGTCGCTCGCTCCTCTCCTACTTGGATAACTGTGGTAATTCTAGAGCTAATACATGCC
GACGGGCGCTGACCCCCTTCGCGGGGGGGATGCGTGCATTTATCAGATCAAAACCAACCCGGTCAGCCCC
TCTCCGGCCCCGGCCGGGGGGCGGGCCGCGGCGGCTTTGGTGACTCTAGATAACCTCGGGCCGATCGCAC
GCCCCCCGTGGCGGCGACGACCCATTCGAACGTCTGCCCTATCAACTTTCGATGGTAGTCGCCGTGCCTA
CCATGGTGACCACGGGTGACGGGGAATCAGGGTTCGATTCCGGAGAGGGAGCCTGAGAAACGGCTACCAC
ATCCAAGGAAGGCAGCAGGCGCGCAAATTACCCACTCCCGACCCGGGGAGGTAGTGACGAAAAATAACAA
TACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTCCACTTTAAATCCTTTAACGAGGATCCATTGGA
GGGCAAGTCTGGTGCCAGCAGCCGCGGTAATTCCAGCTCCAATAGCGTATATTAAAGTTGCTGCAGTTAA
AAAGCTCGTAGTTGGATCTTGGGAGCGGGCGGGCGGTCCGCCGCGAGGCGAGCCACCGCCCGTCCCCGCC
CCTTGCCTCTCGGCGCCCCCTCGATGCTCTTAGCTGAGTGTCCCGCGGGGCCCGAAGCGTTTACTTTGAA
AAAATTAGAGTGTTCAAAGCAGGCCCGAGCCGCCTGGATACCGCAGCTAGGAATAATGGAATAGGACCGC
GGTTCTATTTTGTTGGTTTTCGGAACTGAGGCCATGATTAAGAGGGACGGCCGGGGGCATTCGTATTGCG
CCGCTAGAGGTGAAATTCTTGGACCGGCGCAAGACGGACCAGAGCGAAAGCATTTGCCAAGAATGTTTTC
ATTAATCAAGAACGAAAGTCGGAGGTTCGAAGACGATCAGATACCGTCGTAGTTCCGACCATAAACGATG
CCGACCGGCGATGCGGCGGCGTTATTCCCATGACCCGCCGGGCAGCTTCCGGGAAACCAAAGTCTTTGGG
TTCCGGGGGGAGTATGGTTGCAAAGCTGAAACTTAAAGGAATTGACGGAAGGGCACCACCAGGAGTGGAG
CCTGCGGCTTAATTTGACTCAACACGGGAAACCTCACCCGGCCCGGACACGGACAGGATTGACAGATTGA
TAGCTCTTTCTCGATTCCGTGGGTGGTGGTGCATGGCCGTTCTTAGTTGGTGGAGCGATTTGTCTGGTTA
ATTCCGATAACGAACGAGACTCTGGCATGCTAACTAGTTACGCGACCCCCGAGCGGTCGGCGTCCCCCAA
CTTCTTAGAGGGACAAGTGGCGTTCAGCCACCCGAGATTGAGCAATAACAGGTCTGTGATGCCCTTAGAT
GTCCGGGGCTGCACGCGCGCTACACTGACTGGCTCAGCGTGTGCCTACCCTACGCCGGCAGGCGCGGGTA
ACCCGTTGAACCCCATTCGTGATGGGGATCGGGGATTGCAATTATTCCCCATGAACGAGGAATTCCCAGT
AAGTGCGGGTCATAAGCTTGCGTTGATTAAGTCCCTGCCCTTTGTACACACCGCCCGTCGCTACTACCGA
TTGGATGGTTTAGTGAGGCCCTCGGATCGGCCCCGCCGGGGTCGGCCCACGGCCTGGCGGAGCGCTGAGA
AGACGGTCGAA
Questions
>query2
TACGAACGCTGGCGGCATGCTAATACATGCAAGTCGAACGAGACCTTCGGGTCTAGTGGCGCACGGGTGG
CTAACGCGTGGGAATCTGCCCTTGGGTTCGGAATAACTTCGGGAAACTGAAGCTAATACCGGATGATGAC
GAAAGTCCAAAGATTTATCGCCCAGGGATGAGCCCGCGTAGGATTAGCTAGTTGGTGGGGTAAAGGCTCA
CCAAGGCAACGATCCTTAGCTGGTCTGAGAGGATGATCAGCCACACTGGGACTGAGACACGGCCCAGACT
CCTACGGGAGGCAGCAGTAGGGAATATTGGACAATGGGCGAAAGCCTGATCCAGCAATGCCGCGTGAGTG
ATGAAGGCCTTAGGGTTGTAAAGCTCTTTTACCCGAGATGATAATGACAGTATCGGGAGAATAAGCTCCG
GCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGAGCTAGCGTTGTTCGGAATTACTGGGCGTAAAG
CGCACGTAGGCGGCGATTTAAGTCAGAGGTGAAAGCCCGGGCTCAACCCCGAACTGCCTTTGAGACTGGA
TTGCTAGAATCTTGGAGAGGCGAGTGGAATTCCGAGTGTAGAGGTGAAATTCGTAGATATTCGGAAGAAC
ACCAGTGCGAAGGCGGCTCGCTGGACAAGTATTGACGCTGAGGTGCGAAAGCGTGGGGAGCAAACAGGAT
TAGATACCCTGGTAGTCCACGCCGTAAACGATGATAACTAGCTGCCGGGGCACATGGTGTTTCGGTGGCG
CACGTAACGCATTAAGTTATCCGCCTGGGGAGTACGGTCGCAAGATTAAAACTCAAAGGAATTGACGGGG
GCCTGCACAAGCGGTGGAGCATGTGGTTTAATTCGAAGCAACGCGCAGAACCTTACCAGCGTTTGACATC
CTCATCGCGGATTTCAGAGATGATTTCCTTCAGTTCGGCTGGATGAGTGACAGGTGCTGCATGGCTGTCG
TCAGCTCGTGTCGTGAGATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTCGCCTTTAGTTGCCAGCAT
TTAGTTGGGTACTCTAAAGGAACCGCCGGTGATAAGCCGGAGAAGGTGGGGATGACGTCAAGTCCTCATG
GCCCTTACGCGCTGGGCTACACACGTGCTACAATGGCGACTACAGTGGGCTGCAACCGTGCGAGCGGTAG
CTAATCTCCAAAAGTCGTCTCAGTTCGGATTGTTCTCTGCAACTCGAGAGCATGAAGGCGGAATCGCTAG
TAATCGCGGATCAGCATGCCGCGGTGAATACGTTCCCAGGCCTTGTACACACCGCCCGTCACACCATGGG
ATTTGGATTCACCCGAAGGCACTGCGTTAACCCGCAAGGGAGACAGGTGACCACGGTGGGTTTAGAGACT
GGGGTGAA
Questions
Using BLASTN
• Find what do each one of these sequences code
for.
Questions
Questions
•
•
To which organism each sequence is related?
Do these sequences code for proteins?
Pretend the information for answering previous questions is
not available to you could you suggest a way to answer these
questions anyway?
BLASTX
Questions
•
Look carefully at the e-value column of the first 50
results of each query. What can you learn about
these sequences? Are these sequences generally
conserved between other organisms?
5 last answers
Questions
•
Use bl2seq to align the two query sequences.
What can you say about the relation between
them? Based does this last result make sense?
Questions
You have two query sequences.
>query3
ATGTCTGCTCCACAAGCCAAGATTTTGTCTCAAGCTCCAACTGAATTGGAATTACAAGTT
GCTCAAGCTTTCGTTGAATTGGAAAATTCTTCTCCAGAATTGAAAGCTGAGTTGAGACCT
TTGCAATTCAAGTCCATCAGAGAAGT
>query4
GTATGTTATTAATTTGAATCTAAACTTAAGAATAATGGAGAGTAACAAAGGAAAAAAGTG
TGAACGGGACGATACCAGAATGTTTCAATCTAGAAAAGTATAAAAGATAAGGACTAGGAC
TCAAATGTATTTGGCTGACTATCGCCTGAACCTTGATGCTAAGCAAATACCATATCTTCA
AGAAAAAGCCTACTCCAGTGTTTAAGAAGAAGGGAACGATTTACTAGATCATGCTATACG
CAGTAAGGTTCTGATAGTTAATTACAATCGGTCCAAGTTCTAAGCGGTGTCGTCCATGCA
TATATCATTTACAAGTTACTGGCGTCAACTCTTCAAATATTCAAAATATCACCTAATCAA
ACTTACTAACATTTTCCTTTTTTGTTTTCCTTCTTTTATAG
Now use BlastX
•
To what protein does these sequences code for?
•
are these proteins conserved in other organisms?
Questions
Now use BlastX
•
To what protein does these sequences code for?
•
are these proteins conserved in other organisms?
Query 4
3
No protein – e-value 3.2
A conserved protein component of the small (40S) subunit of S. cerevisiae.
Questions
•
You are told that the sequences were extracted from the
same gene. How could you explain the above results?
•
Answer: query4 is extracted from a non-coding region
(intron) and thus doesn’t code for any protein.