Download שקופית 1 - Tel Aviv University

Workshop OUTLINE Part 1: • Introduction and motivation • How does BLAST work? Part 2: • BLAST programs • Sequence databases • Work Steps • Extract and analyze results BLAST programs • All types of searches are possible Query: DNA Protein Database: DNA Protein blastn – nuc vs. nuc blastp – prot vs. prot blastx – translated query vs. protein database tblastn – protein vs. translated nuc. DB tblastx – translated query vs. translated database 2 BLAST programs Amino acid sequence – most suitable for homology search • The database and the query can be either nucleotides or amino acids! • We prefer amino acid sequence: -amino acid sequence is more conserved -20 letter alphabet. Two random hits share 5% identity in average (comparing to 25% in DNA seq). -protein comparison matrices are more sensitive . - protein databases are smaller – less random hits. - we want to conclude about the structure- proteins are much more relevant. General Issues • Where? (to find homologues) • Structural templates- search against the PDB • Sequence homologues- search against SwissProt or Uniprot (recommended!) • How many? • As many as possible, as long as the MSA looks good (next week…) General Issues • How long? (length of homologues) • Fragments- short homologues (less than 50,60% the query’s length) = bad alignment • Ensure your sequences exhibit the wanted domain(s) • N/C terminal tend to vary in length between homologues • How close? (distance from query sequence) • All too close- no information • Too many too far- bad alignment • Ensure that you have a balanced collection! General Issues • From who? (which species the sequence belongs to) • Don’t care, all homologues are welcome • Orthologues/paralogues may be helpful • Sequences from distant/close species provide different types of information • Which method? (BLAST/PSI-BLAST) • Depends on the protein, available homologues, the goal in mind… Sequence databases Where do we want to search? DNA sequences • ESTs- no annotated coding sequence pool. the largest pool of sequence data for many organisms (NCBI) • NR- All GenBank + EMBL + DDBJ + PDB sequences. No longer "nonredundant" due to computational cost. • Genomes a specific organisms • RefSeq- mRna or genomic- an annotated collection from NCBI Reference Sequence Project. • EMBL- Europe's primary nucleotide sequence resource (EBI) • …. Sequence databases Where do we want to search? Protein databases: • PDB- the sequences of proteins for which structures are available • NR (non-redundant)- Non-redundant GenBank CDS translations + PDB + SwissProt + PIR + PRF, excluding those in env_nr • RefSeq- sequences from NCBI Reference Sequence project. • Proteins of a specific organisms • Uniprot –swissprot or trembl • …. Sequence databases Where do we want to search? UniProt • UniProt is a collaboration between the European Bioinformatics Institute (EBI), the Swiss Institute of Bioinformatics (SIB) and the Protein Information Resource (PIR). • In 2002, the three institutes decided to pool their resources and expertise and formed the UniProt Consortium. Sequence databases Where do we want to search? UniProt • The world's most comprehensive catalog of information on proteins- Sequence, function & more… • Comprised mainly of the databases: – SwissProt – 366226 last year, 412525 protein entries now – high quality annotation, non-redundant & cross-referenced to many other databases. – TrEMBL - 5708298 last year, 7341751 protein entries now – computer translation of the genetic information from the EMBL Nucleotide Sequence Database  many proteins are poorly annotated since only automatic annotation is generated Overall work steps 1. Run the search1. Select database 2. E-value threshold 3. BLAST or PSI-BLAST- how many rounds? 2. Take out sequences 1. HSP or full sequences 2. Can (should!) filter out redundant and sequences that are too short (fragments) 3. Usually- align sequences- choose alignment program 4. View alignment with BioEdi tor another program 5. Calculate trees, conservatino scores (conseq) etc… Overall work steps Multiple Sequence Alignment (MSA) • Perform alignment of a large collection of sequences • Many algorithms, leading ones: 1. ClustalW 2. MUSCLE 3. T-COFFEE Overall work steps Examining BaliBase 2005… MUSCLE is superior! Edgar, R.C., 2004 BLAST NCBI BLAST NCBI The well-known server http://blast.ncbi.nlm.nih.gov/Blast.cgi • All program types • Many databases to chose from, both nucleotide and protein • 12 genome-specific databases • Can also look for conserved domain, SNPs and more… BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi BLASTp BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Query Sequence Database Run BLASTp BLAST NCBI As many as possible Evalue Matrix BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Mark all Mark only wanted BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi BLAST EBI BLAST EBI http://www.ebi.ac.uk/blastall/index.html Many databases, including UniProt Get maximum number of alignments! Insert sequence RUN BLAST EBI http://www.ebi.ac.uk/blastall/index.html Mark all or wanted Send sequences to ClustalW Get sequences PSI-BLAST PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Query Sequence Database Run PSI-BLAST PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Pre-calculated PSSM Threshold for inclusion in PSSM PSI-BLAST NCBI http://www.ncbi.nlm.nih.gov/blast/Blast.cgi Run next round Not found in previous round Include sequence in the PSSM PSI-BLAST EBI http://www.ebi.ac.uk/blastpgp/ Database Number of iterations Query Sequence Run (PSI-)BLAST on ConSeq, extract sequence & align PSI-BLAST on ConSeq The ConSeq webserver • Calculates evolutionary conservation scores that are than displayed on the sequence. • Requires a Multiple Sequence Alignment (MSA)- if nor provided, can create one automatically • Runs (PSI-)BLAST, extracts hits from the BLAST results, filters according to e-value and aligns the sequences. PSI-BLAST on ConSeq The ConSeq webserver-http://conseq.tau.ac.il/ PSI-BLAST on ConSeq The ConSeq webserver-http://conseq.tau.ac.il/ Query sequence Email PSI-BLAST on ConSeq The ConSeq webserver-http://conseq.tau.ac.il/ Alignment algorithm Databaseswissprot or uniprot No. of homologues Iterations E-value PSI-BLAST on ConSeq The ConSeq webserver-http://conseq.tau.ac.il/ PSI-BLAST on ConSeq The ConSeq webserver-http://conseq.tau.ac.il/ All BLAST hits MSA NCBI vs. EBI vs. ConSeq Summary of web servers: 1. PSI-BLAST at NCBI- Can control PSSM, included sequences & threshold - All types of BLAST programs - Not against UniProt- SwissProt or NR - Against RefSeq and NT - Full sequences downloaded like BLAST - Number of sequences up to 2000 NCBI vs. EBI vs. ConSeq Summary of web servers: 2. BLAST at EBI – - Against UniProt or EMBL, not NR or specific genomes - Can’t control PSSM- just get last round - Download and align only full sequences - The number of presented sequences is limited to 500 - blastN, blastP, tblastN, tblastX NCBI vs. EBI vs. ConSeq Summary of web servers: 3. BLAST at ConSeq – • Get HSPs, not entire sequences!!! • Only blastP • Search uniprot/swissprot • Still, can’t control all options… such as redundancy and minimal length of HSP (PSI-)BLAST via Max-Planck (PSI-)BLAST via Max-Planck Run (PSI-) BLAST Send HSP or full sequences to an alignment program Forward HSP to filtration via “BLAMMER” Download filtered sequences Align the sequences via program of choice (PSI-)BLAST via Max-Planck BLAST at Max-Planc http://toolkit.tuebingen.mpg.de/sections/search • Databases- swissprot, tremble, NR, env, pdb or any combination for proteins, but only NT for DNA. • All BLAST programs • Main advantage- you can easily extract and filter the HSPs, on top of full sequences. The Query Protein Name: Dihydrodipicolinate reductase Enzyme reaction: Molecular process: Lysine biosynthesis (early stages) Organism: E. coli Sequence length: 273 aa The Query Protein Query: DAPB_ECOLI <DAPB_ECOLI MHDANIRVAIAGAGGRMGRQLIQAALALEGVQLGAALEREGSSLLGSDAGELAGAG KTGVTVQSSLDAVKDDFDVFIDFTRPEGTLNHLAFCRQHGKGMVIGTTGFDEAGKQ AIRDAAADIAIVFAANFSVGVNVMLKLLEKAAKVMGDYTDIEIIEAHHRHKVDAPSGTA LAMGEAIAHALDKDLKDCAVYSREGHTGERVPGTIGFATVRAGDIVGEHTAMFADIGE RLEITHKASSRMTFANGAVRSALWLSGKESGLFDMRDVLDLNNL (PSI-)BLAST via Max-Planck http://toolkit.tuebingen.mpg.de/psi_blast/ Upload sequence or MSA Choose database or databases (selecting a few using CTRL) (PSI-)BLAST via Max-Planc Save PSi-BLAST result (PSI-)BLAST via Max-Planck E-value threshold can be assessed using the distribution Filter Results via Max-Planck Forward results to BLAMMER Filter Results via Max-Planck BLAMMER http://toolkit.tuebingen.mpg.de/blammer/ • Suppose to create MSAs from BLAST results, we will use it just to filter the results and then align them via MUSCLE or another known MSA program. • Filter according to: • E-value • Min. coverage- min. percent of the query protein • Max. redundancy- extract similar sequences • Max. number of homolgoues- if wanted Filter Results via Max-Planck http://toolkit.tuebingen.mpg.de/blammer Forwarded PSIBLAST result Filtering parameters Filter Results via Max-Planck Save & then re-align! Align the BLAST sequences Align via Max-Planck http://toolkit.tuebingen.mpg.de/sections/alignment Align via Max-Planck 1.Forward BLAST to MUSCLE, MAFFT etc... Choose program Use hits or full sequences Align via Max-Planck 2. Filter via BLAMMER and then ALIGN: Upload the results of the BLAMMER – downloaded file Align via Max-Planck Alignment results: Save the alignment Alignmen viewing & editing BioEdit • http://www.mbio.ncsu.edu/BioEdit/BioEdit.html • Easy-to-use sequence alignment editor • View and manipulate alignments up to 20,000 sequences. •Four modes of manual alignment: select and slide, dynamic grab and drag, gap insert and delete by mouse click, and on-screen typing which behaves like a text editor. •Reads and writes Genbank, Fasta, Phylip 3.2, Phylip 4, and NBRF/PIR formats. Also reads GCG and Clustal formats Alignmen viewing & editing Easiest Using Bioedit http://www.mbio.ncsu.edu/BioEdit/bioedit.html Alignmen viewing & editing Easiest Using Bioedit • Find a specific sequence: “Edit-> search -> in titles” • Erase\add sequences: “Edit-> cut\paste\delete sequence” • “Sequence Identity matrix” under “Alignment”useful for a rough evaluation of distances within the alignment. • After taking out sequences, “Minimize Alignment” under “Alignment” takes out unessential gaps. • Can save an image using: “File -> Graphic View” & then “Edit -> Copy page as BITMAP” http://www.mbio.ncsu.edu/BioEdit/bioedit.html No “Miracle solution”  Each sequence is a different story  adjust parameters: • BLAST- E-value, substitution matrix, gap penalties, database, minimum length, redundancy level, fragment overlap… • PSI-BLAST- BLAST parameters + PSSM inclusion threshold (or chose manually), number of rounds… • Try using HSP or full sequences, different MSA programs… THANKS Some slides were taken from previous presentations by members of the Pupko lab and Prof. Beni Chor

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download שקופית 1 - Tel Aviv University