Download Hands On - Gene Prediction in Prokaryotes file

Gene Prediction in Prokaryotes Gene prediction is the process of identifying the regions of genomic DNA that encode genes. This step follows after the genome of a species has been sequenced. In general, the process includes identifying protein-coding regions, RNA genes, and regulatory regions. The three major categories of gene prediction algorithms are alignment-based, sequence-based, and content-based. Some algorithms are consensus-based i.e. they combine the results of multiple programs. Alignment-based algorithms are based on finding orthologs of the query sequence. If an ortholog is found, one may extrapolate that the gene being queried is probably a similar gene with a similar structure and function. BLAST is widely used for this approach. However, this method will not work for a new gene, for which currently there is no ortholog in the databases. Sequence-based algorithms look for gene signals, i.e. certain patterns found in a gene. This includes start codon (AUG), stop codon (TAA, TAG, TGA), promoter sequence, intron-exon boundaries (GT, AG), Shine-Dalgarno sequence or ribosomal binding sites (5’AGGAGG), polyA tails. ORF-Finder programs are good examples of sequence-based algorithms. Content-based algorithms look for certain characteristics or patterns of coding regions which are significantly different from those found in non-coding regions. This includes nucleotide frequency or coding frequency in a particular organism. Identifying CpG islands is an example of this method. Gene prediction is comparatively easier in prokaryotes than eukaryotes. The bacterial genome usually consists of a single circular chromosome and ranges in size from a million to to 10 million bps; eukaryotic genomes are much larger and range in size from 10Mbp-670Gbp. Prokaryotes have high gene density, 85% of the genome consists of coding genes. Eukaryotes have low gene density, 3% of the genome codes for genes, spaces between genes is very large in repetitive sequences and tranposable elements. This module is devoted to exploring prokaryotic genes. Identifying an ORF. The usual start codon in prokaryotes is ATG. However, GTG and TTG may act as START signals. START codon occurs every 20 codons by chance in a non coding sequence. Any frame longer than 30 codons without a STOP is a suspect ORF. Threshhold is usually at 50-60 codons. Furthermore, Ribosomal binding site (ShineDelgarno sequence) located upstream of START codon verifies the presence of an ORF (protein coding region). Further confirmation of protein coding region can be obtained by translating this sequence into protein sequence and search protein database for orthologs. 1 We will use several gene prediction programs to identify putative genes. Each gene may be translated into protein sequence and searched against protein databases for orthologs. However, the emphasis, in this document, will be on sequence-based algorithms: 1. ORF Finder program http://www.ncbi.nlm.nih.gov/genome/browse/ 2. Easy Gene http://www.cbs.dtu.dk/services/EasyGene/ 3. Neural Network Promoter Prediction http://www.fruitfly.org/seq_tools/promoter.html 4. Virtual Footprint http://www.prodoric.de/vfp/ 5. Glimmer3 – NCBI and gene ID http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi I. Find a known gene and acquire its DNA sequence 1. Open IMG home page - http://img.jgi.doe.gov/cgi-bin/edu/main.cgi 2. Under Find Genome, select Genome Search. 3. Enter Cryptobacterium curtum in the Key Word search field; Select Filter Genome Name. Click GO. 4. Search results will be displayed. Click Cryptobacterium curtum DSM 15641 under Genome Name. 5. The resulting page shows the details of Cryptobacterium curtum, DSM 15641 in four sections: Overview; Statistics; Expression Studies; Genes. 6. Review the information and answer following questions: i. What disease is caused by Cryptobacterium curtum? Hint: Look under Habitat in Overview. ii. How many protein coding genes are displayed? 7. Click on the number (of genes) displayed and on the resulting page find Gene ID 644993009 - Fe-dependent oxidoreductase, alcohol dehydrogenase. 8. Click on the ID No. 644993009 to see the Gene Information. Under External Links, click on GI:256826541 . 9. Detailed information on the gene will be displayed. Click on FASTA on the top and Download the protein sequence displayed. Save the sequence. 2 II. Find ORFs with NCBI ORF Finder program 1. Open NCBI Education Website http://www.ncbi.nlm.nih.gov/ 2. Click on Nucleotide under NCBI Educational Resources. 3. Enter GenBank Acc. No. U33186 in the search box; click search. 4. On the page displayed click on FASTA to download the complete sequence of Mycobacterium fortuitum/Escherichia coli shuttle vector pSUM36 5. Click on the ORF Finder program link http://www.ncbi.nlm.nih.gov/genome/browse/ 6. Either paste the sequence of Plasmid vector pSUM36 or enter the GenBank Accession No. U33186 for Plasmid vector pSUM36. Click ORFFind. 7. All ORFs in six reading frames will be displayed. Examin the output and answer the following questions: 1. 2. 3. 4. 5. What is a plasmid? What are Reading Frames; why six? What is the minimum length of an ORF? What is displayed when you click on an ORF or its graphic? How will you carry out BLAST search on the ORF you are interested in? III. Find genes with EasyGene The EasyGene 1.2 server produces a list of predicted genes given a sequence of prokaryotic DNA. The current version contains models for 138 different organisms. The user can pick one of these 138 organisms that is closely related to the one from which the query sequence was taken. These prokaryotic genomes provide training sets for the program. EasyGene will pick potential ORFs with scores above a specified threshold. 1. Open NCBI Education Website http://www.ncbi.nlm.nih.gov/ 2. Click on Nucleotide under NCBI Educational Resources. 3. Search for pBR322 Plasmid in NCBI, Nucleotide search (Ref: Microbiology, 5th Ed. P. 334, Prescott, Harley, Kline) 4. On the page displayed click on FASTA to download the complete sequences of BiFC vector pFAGN173 and 3 Escherichia coli ETEC H10407 plasmid p52 5. Connect to EasyGene 1.2 server. http://www.cbs.dtu.dk/services/EasyGene/ 6. Click Browse and point to the saved sequence of BiFC vector pFAGN173. (Alternatively, you may copy and paste the sequence in the text box). 7. Select Escherichia coli K12, a closely related organism (from 138 Model organisms). Select the R-value of 2. 8. Click Submit. 9. Repeat steps 6-9 for Escherichia coli ETEC H10407 plasmid p52. Review the outputs. How many ORFs were found in BiFC vector pFAGN173 and in BiFC vector pFAGN173? Exercise (Optional): Use NCBI ORF Finder program to identify ORFs in BiFC vector pFAGN173. Compare the results with those from EasyGene. Do you see any differences? IV. Neural Network Promoter Prediction a. Berkeley Drosophila Genome Project (BDGP) This program searches for -10 and -35 promotor sequences in a DNA sequence of a prokaryote. 1. Search for Mycobacterium tuberculosis gene – fas in NCBI Gene database. 2. On the search page click on FASTA under Reference Assembly, Genome and download the sequence. 3. Open Neural Network Promoter Prediction Site http://www.fruitfly.org/seq_tools/promoter.html This will open up the Berkley Drosophila Genome Project web site. 4. Select Prokaryote. 5. Paste the sequence in the box and click Submit. 6. The program predicts two promoters. The transcription start is shown in larger font. 4 b. Virtual Footprint This program searches for regulatory binding sites in a given DNA sequence. 1. Open Virtual Footprint 2. 3. 4. 5. http://www.prodoric.de/vfp/ Click Promoter Analysis. It analyzes promoter region with several regulatory patterns. Paste the raw sequence (Remove the first line in the FASTA sequence) in the “Paste Raw Sequence” box. Select the pattern, “Sig 70 (-10) Escherichia coli” and click START to start searching. Three binding sites are displayed by default. This number can be changed if needed. V. Glimmer3 Glimmer (Gene Locator and Interpolated Markov ModelER) can be used to find genes in bacteria, archaea, and viruses. Glimmer3 uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. Glimmer was developed at The Institute for Genomic Research (TIGR). It has been used to annotate the complete genomes of over 100 bacterial species. It can be accessed at the NCBI Glimmer site or at geneid web site. 1. Open NCBI Education Website http://www.ncbi.nlm.nih.gov/ 2. Click on Genome under NCBI Educational Resources. 3. Cick on Browse By Organism 4. Search for Cryptobacterium curtum. On the next page, click on Cryptobacterium curtum (at the bottom of the screen) 5 5. On the page displayed click on FASTA to download complete genome of Cryptobacterium curtum DSM 15641 chromosome. 6. Open Glimmer3 - http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi This will open up Microbial Genome Annotation Tools page. 7. Select (Bacteria, Archea) under Genetic code. 8. Select Circular under Topology. 9. Paste the sequence in the box and click Run Glimmer v3.02. Note: Since the Cryptobacterium curtum genome is 1.5 Mb long, only a part of the genome was pasted in the above exercise. 10. Repeat above exercise with Mycobacterium tuberculosis genome. 11. Open Glimmer3 - http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi This will open up Microbial Genome Annotation Tools page. 12. Select 4 (Microplasma/Spiroplasma) under Genetic code. 13. Select Circular under Topology. 14. Paste the sequence in the box and click Run Glimmer v3.02. 15. Glimmer3 Output displays predictions in Mycobacterium tuberculosis genome 6

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hands On - Gene Prediction in Prokaryotes file