Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Zinc finger nuclease wikipedia , lookup
Western blot wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Protein purification wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Protein domain wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Structural alignment wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Review Concepts 20040715 Chuong Huynh NCBI Pairwise Sequence Alignments • Purpose: • identification of sequences with significant similarity to (a) sequence(s) in a sequence-repository • identification of all homologous sequences the repository • identification of domains with sequence similarity • Terminology NCBI • Global alignment • Local alignment Terminology: Global Alignment NCBI • Finds the optimal alignment over the entire length of the two compared sequences • Unlikely to detect genes that have evolved by recombination (e.g. domain shuffling) or insertion/deletion of DNA • Suitable for sequences of homologous molecules Terminology: Local Alignment NCBI • short regions of similarity between a pair of sequences. • compared sequences can receive high local similarity scores, without the need to have high levels of similarity over their entire length • useful when looking for domains within proteins or looking for regions of genomic DNA that contain coding exons An alignment that BLAST can’t find 1 GAATATATGAAGACCAAGATTGCAGTCCTGCTGGCCTGAACCACGCTATTCTTGCTGTTG || | || || || | || || || || | ||| |||||| | | || | ||| | 1 GAGTGTACGATGAGCCCGAGTGTAGCAGTGAAGATCTGGACCACGGTGTACTCGTTGTCG 61 GTTACGGAACCGAGAATGGTAAAGACTACTGGATCATTAAGAACTCCTGGGGAGCCAGTT | || || || ||| || | |||||| || | |||||| ||||| | | 61 GCTATGGTGTTAAGGGTGGGAAGAAGTACTGGCTCGTCAAGAACAGCTGGGCTGAATCCT NCBI 121 GGGGTGAACAAGGTTATTTCAGGCTTGCTCGTGGTAAAAAC |||| || ||||| || || | | |||| || ||| 121 GGGGAGACCAAGGCTACATCCTTATGTCCCGTGACAACAAC BLAST Selection Matrix NCBI Choosing The Right BLAST Flavor for Proteins The Right BLAST Flavor Find out something about the function of the protein Use blastp to compare your protein with other proteins contained in the databases. Use tblastn to compare your protein with DNA sequences translated into their 6 possible reading frames Discover new genes encoding similar proteins Claverie & Notredame 2003 NCBI What you Want to Do? Questions Choosing the Right BLAST Flavor for DNA Answer Yes, Use blastn. Rem: blastn is only for closely related DNA sequences (more than 70% identical) Do I want to discover new proteins? Yes, Use tblastx Do I want to discover proteins encoded in my query DNA sequences? Yes, Use blastx Am I unsure of the quality of my DNA? Yes, Use blastx. Especially if you suspsect your DNA sequence codes for a protein, but may contain sequencing errors. Claverie & Notredame 2003 NCBI Am I interested in non coding DNA? Choosing The Right BLAST Flavor for DNA Sequences Query Database Program Find very similar DNA sequence Protein discovery and ESTs Analysis of query DNA sequence DNA DNA blastn Translated DNA Translated tblastx DNA Translated DNA Protein blastx Claverie & Notredame 2003 NCBI Usage BLAST Tips NCBI • It is faster and more accurate to BLAST proteins (blastp) rather than nucleotides. • If in doubt use blastp. • When possible restrict to the subset of the database you are interested in. • Look around for the database you need or create your own custom BLAST database. BUT HOW??? • When is the best time to use the BLAST server? Asking Biological Problems with BLAST General (but More Complicated) Computational Method Using BLAST Finding genes in a genome Run gene prediction software or an ORF Finder (for bacteria) Cut your genome sequence in little (2-5kb) overlapping sequences. Use blastx to BLAST each piece of genome against NR (nonredundant protein db). Works better for sequences with no introns (bacteria). Predicting protein function Domain analysis or wet-lab experimentation Use blastp to BLAST your protein sequence against SWISS-Prot (future = UniProt). If you get a good hit (more than 25% identify) over the complete length of the protein, then your protein has the same function as the SWISS-PROT protein Predicting protein 3-D structure Homology modeling, Xray, NMR analysis of protein of interest Use blastp to BLAST your protein against PDB (Protein structure DB), if you get hit >25% identity, then your protein and the good hit(s) have a similar 3-D structure Finding protein family members Clone new family members using PCR techniques Use blastp (or better use PSI-BLAST) and run against NR (nonredundant protein family). After you have all members of family, you can make multiple sequence alignment phylogenetic tree Claverie & Notredame 2003 NCBI What You Want to DO BLAST and PSI-BLAST Servers on the Internet Country Program URL USA USA EUROPE BLAST/ PSIBLAST BLAST BLAST BLAST Japan BLAST/ PSIBLAST http://genome.wustl.edu/gsc/BLAST http://www.ch.embnet.org/software/b BLAST.html http://www.ebi.ac.uk/blast2/ http://www.ddbj.nig.ac.jp/E-mail/ homology.html NCBI Europe http://www.ncbi.nlm.nih.gov/BLAST Common Mistake Sequence 1: AAAAAABBBBBB Sequence 2: AAAAAA Sequence 3: BBBBBB NCBI • Seq1 has domain A & B; Seq2 has domain A and Seq3 has domain B • Use Seq 1 as query sequence • What happens? E-value of both of these hits may be very high if domain A and B are long and well conserved. • Seq1 is homologous to Seq2&3, but remember Seq1 is not homlogous over the entire length to Seq2&3 • Just don’t depend on the E-value • “BLAST hits are not transitive, unless the alignments are overlapping” • Most proteins have more than one domain, so becareful when looking a BLAST results, not all reported hits belong to the same big family. Alternative Method for Homology Searches NCBI • Smith-Waterman (ssearch): slower but more accurate • FASTA: slower than BLAST, but more accurate when making DNA comparison • BLAT: for locating cDNA in a genome or finding close proteins in a genome Common Questions • When I do a blast job using WU-BLAST vs NCBI BLAST with the same query sequence, I get a different result? Both are based on the same algorithm, but a different implementation. So why the difference? NCBI Usually this is due to the slight variation in the database version, but differences in BLAST program version also play a minor role in the difference. Usually the result, do not change in a dramatic manner, but they do change a bit. Basic Gene Prediction Flow Chart Obtain new genomic DNA sequence 1. Translate in all six reading frames and compare to protein sequence databases 2. Perform database similarity search of expressed sequence tag Sites (EST) database of same organism, or cDNA sequences if available Use gene prediction program to locate genes NCBI Analyze regulatory sequences in the gene The Annotation Process ANNALYSIS SOFTWARE DNA SEQUENCE Useful Information NCBI Annotator Annotation Process DNA sequence Blastn Repeats Promoters Fasta BlastP Gene finders rRNA Pfam Blastx Halfwise Pseudo-Genes Prosite Psort tRNA scan Genes SignalP tRNA TMHMM NCBI RepeatMasker How do I do large scale genome analysis? • Read Koonin’s book on NCBI Bookshelf NCBI Demo TaxPlot TaxPlot is a tool for three-way comparisons of genomes on the basis of the protein sequences they encode. NCBI http://www.ncbi.nlm.nih.gov/sutils/taxik2.cgi Demo - VecScreen NCBI http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html