Download שקופית 1 - Tel Aviv University

Document related concepts

Silencer (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Molecular ecology wikipedia , lookup

Community fingerprinting wikipedia , lookup

Proteolysis wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genetic code wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Protein structure prediction wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Workshop OUTLINE
Part 1:
• Introduction and motivation
• How does BLAST work?
Part 2:
• BLAST programs
• Sequence databases
• Work Steps
• Extract and analyze results
BLAST programs
• All types of searches are possible
Query:
DNA
Protein
Database:
DNA
Protein
blastn – nuc vs. nuc
blastp – prot vs. prot
blastx – translated query vs. protein database
tblastn – protein vs. translated nuc. DB
tblastx – translated query vs. translated database
2
BLAST programs
Amino acid sequence –
most suitable for homology search
• The database and the query can be either nucleotides or
amino acids!
• We prefer amino acid sequence:
-amino acid sequence is more conserved
-20 letter alphabet. Two random hits share 5% identity in
average (comparing to 25% in DNA seq).
-protein comparison matrices are more sensitive .
- protein databases are smaller – less random hits.
- we want to conclude about the structure- proteins are
much more relevant.
General Issues
• Where? (to find homologues)
• Structural templates- search against the PDB
• Sequence homologues- search against SwissProt or
Uniprot (recommended!)
• How many?
• As many as possible, as long as the MSA looks good
(next week…)
General Issues
• How long? (length of homologues)
• Fragments- short homologues (less than 50,60% the
query’s length) = bad alignment
• Ensure your sequences exhibit the wanted domain(s)
• N/C terminal tend to vary in length between homologues
• How close? (distance from query sequence)
• All too close- no information
• Too many too far- bad alignment
• Ensure that you have a balanced collection!
General Issues
• From who? (which species the sequence belongs to)
• Don’t care, all homologues are welcome
• Orthologues/paralogues may be helpful
• Sequences from distant/close species provide different
types of information
• Which method? (BLAST/PSI-BLAST)
• Depends on the protein, available homologues, the goal
in mind…
Sequence databases
Where do we want to search?
DNA sequences
• ESTs- no annotated coding sequence pool. the largest pool of sequence
data for many organisms (NCBI)
• NR- All GenBank + EMBL + DDBJ + PDB sequences. No longer "nonredundant" due to computational cost.
• Genomes a specific organisms
• RefSeq- mRna or genomic- an annotated collection from NCBI
Reference Sequence Project.
• EMBL- Europe's primary nucleotide sequence resource (EBI)
• ….
Sequence databases
Where do we want to search?
Protein databases:
• PDB- the sequences of proteins for which structures are available
• NR (non-redundant)- Non-redundant GenBank CDS translations + PDB +
SwissProt + PIR + PRF, excluding those in env_nr
• RefSeq- sequences from NCBI Reference Sequence project.
• Proteins of a specific organisms
• Uniprot –swissprot or trembl
• ….
Sequence databases
Where do we want to search?
UniProt
• UniProt is a collaboration between the
European Bioinformatics Institute (EBI), the
Swiss Institute of Bioinformatics (SIB) and
the Protein Information Resource (PIR).
• In 2002, the three institutes decided to pool
their resources and expertise and formed the
UniProt Consortium.
Sequence databases
Where do we want to search?
UniProt
• The world's most comprehensive catalog of information on
proteins- Sequence, function & more…
• Comprised mainly of the databases:
– SwissProt – 366226 last year, 412525 protein entries now –
high quality annotation, non-redundant & cross-referenced to many
other databases.
– TrEMBL - 5708298 last year, 7341751 protein entries now –
computer translation of the genetic information from the EMBL
Nucleotide Sequence Database  many proteins are poorly
annotated since only automatic annotation is generated
Overall work steps
1. Run the search1. Select database
2. E-value threshold
3. BLAST or PSI-BLAST- how many rounds?
2. Take out sequences
1. HSP or full sequences
2. Can (should!) filter out redundant and sequences that
are too short (fragments)
3. Usually- align sequences- choose alignment program
4. View alignment with BioEdi tor another program
5. Calculate trees, conservatino scores (conseq) etc…
Overall work steps
Multiple Sequence Alignment (MSA)
• Perform alignment of a large collection of sequences
• Many algorithms, leading ones:
1. ClustalW
2. MUSCLE
3. T-COFFEE
Overall work steps
Examining BaliBase 2005…
MUSCLE is superior!
Edgar, R.C., 2004
BLAST NCBI
BLAST NCBI
The well-known server
http://blast.ncbi.nlm.nih.gov/Blast.cgi
• All program types
• Many databases to chose from, both nucleotide and
protein
• 12 genome-specific databases
• Can also look for conserved domain, SNPs and more…
BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
BLASTp
BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Query
Sequence
Database
Run
BLASTp
BLAST NCBI
As many as possible
Evalue
Matrix
BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Mark all
Mark only
wanted
BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
BLAST EBI
BLAST EBI
http://www.ebi.ac.uk/blastall/index.html
Many databases,
including UniProt
Get maximum
number of alignments!
Insert
sequence
RUN
BLAST EBI
http://www.ebi.ac.uk/blastall/index.html
Mark all or
wanted
Send sequences
to ClustalW
Get sequences
PSI-BLAST
PSI-BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Query
Sequence
Database
Run
PSI-BLAST
PSI-BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Pre-calculated PSSM
Threshold for inclusion
in PSSM
PSI-BLAST NCBI
http://www.ncbi.nlm.nih.gov/blast/Blast.cgi
Run next round
Not found in
previous round
Include sequence
in the PSSM
PSI-BLAST EBI
http://www.ebi.ac.uk/blastpgp/
Database
Number of
iterations
Query
Sequence
Run
(PSI-)BLAST on ConSeq, extract
sequence & align
PSI-BLAST on ConSeq
The ConSeq webserver
• Calculates evolutionary conservation scores that
are than displayed on the sequence.
• Requires a Multiple Sequence Alignment (MSA)- if
nor provided, can create one automatically
• Runs (PSI-)BLAST, extracts hits from the BLAST
results, filters according to e-value and aligns the
sequences.
PSI-BLAST on ConSeq
The ConSeq webserver-http://conseq.tau.ac.il/
PSI-BLAST on ConSeq
The ConSeq webserver-http://conseq.tau.ac.il/
Query
sequence
Email
PSI-BLAST on ConSeq
The ConSeq webserver-http://conseq.tau.ac.il/
Alignment
algorithm
Databaseswissprot
or uniprot
No. of
homologues
Iterations
E-value
PSI-BLAST on ConSeq
The ConSeq webserver-http://conseq.tau.ac.il/
PSI-BLAST on ConSeq
The ConSeq webserver-http://conseq.tau.ac.il/
All BLAST hits
MSA
NCBI vs. EBI vs. ConSeq
Summary of web servers:
1. PSI-BLAST at NCBI- Can control PSSM, included sequences & threshold
- All types of BLAST programs
- Not against UniProt- SwissProt or NR
- Against RefSeq and NT
- Full sequences downloaded like BLAST
- Number of sequences up to 2000
NCBI vs. EBI vs. ConSeq
Summary of web servers:
2. BLAST at EBI –
- Against UniProt or EMBL, not NR or specific genomes
- Can’t control PSSM- just get last round
- Download and align only full sequences
- The number of presented sequences is limited to 500
- blastN, blastP, tblastN, tblastX
NCBI vs. EBI vs. ConSeq
Summary of web servers:
3. BLAST at ConSeq –
• Get HSPs, not entire sequences!!!
• Only blastP
• Search uniprot/swissprot
• Still, can’t control all options… such as redundancy and
minimal length of HSP
(PSI-)BLAST via Max-Planck
(PSI-)BLAST via Max-Planck
Run (PSI-) BLAST
Send HSP or full
sequences to an
alignment program
Forward HSP to filtration
via “BLAMMER”
Download filtered
sequences
Align the sequences via
program of choice
(PSI-)BLAST via Max-Planck
BLAST at Max-Planc
http://toolkit.tuebingen.mpg.de/sections/search
• Databases- swissprot, tremble, NR, env, pdb or any
combination for proteins, but only NT for DNA.
• All BLAST programs
• Main advantage- you can easily extract and filter the
HSPs, on top of full sequences.
The Query Protein
Name: Dihydrodipicolinate reductase
Enzyme reaction:
Molecular process: Lysine biosynthesis (early stages)
Organism: E. coli
Sequence length: 273 aa
The Query Protein
Query:
DAPB_ECOLI
<DAPB_ECOLI
MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAG
KTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQ
AIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTA
LAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGE
RLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL
(PSI-)BLAST via Max-Planck
http://toolkit.tuebingen.mpg.de/psi_blast/
Upload sequence
or MSA
Choose database or
databases (selecting a
few using CTRL)
(PSI-)BLAST via Max-Planc
Save PSi-BLAST result
(PSI-)BLAST via Max-Planck
E-value threshold can be assessed using the distribution
Filter Results via Max-Planck
Forward results
to BLAMMER
Filter Results via Max-Planck
BLAMMER
http://toolkit.tuebingen.mpg.de/blammer/
• Suppose to create MSAs from BLAST results, we will use it
just to filter the results and then align them via MUSCLE or
another known MSA program.
• Filter according to:
• E-value
• Min. coverage- min. percent of the query protein
• Max. redundancy- extract similar sequences
• Max. number of homolgoues- if wanted
Filter Results via Max-Planck
http://toolkit.tuebingen.mpg.de/blammer
Forwarded PSIBLAST result
Filtering
parameters
Filter Results via Max-Planck
Save & then
re-align!
Align the BLAST sequences
Align via Max-Planck
http://toolkit.tuebingen.mpg.de/sections/alignment
Align via Max-Planck
1.Forward BLAST to MUSCLE, MAFFT etc...
Choose program
Use hits or full
sequences
Align via Max-Planck
2. Filter via BLAMMER and then ALIGN:
Upload the results
of the
BLAMMER –
downloaded file
Align via Max-Planck
Alignment results:
Save the alignment
Alignmen viewing & editing
BioEdit
• http://www.mbio.ncsu.edu/BioEdit/BioEdit.html
• Easy-to-use sequence alignment editor
• View and manipulate alignments up to 20,000 sequences.
•Four modes of manual alignment: select and slide, dynamic grab
and drag, gap insert and delete by mouse click, and on-screen
typing which behaves like a text editor.
•Reads and writes Genbank, Fasta, Phylip 3.2, Phylip 4, and
NBRF/PIR formats. Also reads GCG and Clustal formats
Alignmen viewing & editing
Easiest Using Bioedit
http://www.mbio.ncsu.edu/BioEdit/bioedit.html
Alignmen viewing & editing
Easiest Using Bioedit
• Find a specific sequence: “Edit-> search -> in titles”
• Erase\add sequences: “Edit-> cut\paste\delete sequence”
• “Sequence Identity matrix” under “Alignment”useful for a rough evaluation of distances within the alignment.
• After taking out sequences, “Minimize Alignment” under
“Alignment” takes out unessential gaps.
• Can save an image using:
“File -> Graphic View” & then “Edit -> Copy page as BITMAP”
http://www.mbio.ncsu.edu/BioEdit/bioedit.html
No “Miracle solution” 
Each sequence is a different story 
adjust parameters:
• BLAST- E-value, substitution matrix, gap penalties,
database, minimum length, redundancy level, fragment
overlap…
• PSI-BLAST- BLAST parameters + PSSM inclusion
threshold (or chose manually), number of rounds…
• Try using HSP or full sequences, different MSA programs…
THANKS
Some slides were taken from previous
presentations by members of the Pupko
lab and Prof. Beni Chor