Download Hands On - Gene Prediction in Prokaryotes file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Ridge (biology) wikipedia , lookup

NUMT wikipedia , lookup

Oncogenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Copy-number variation wikipedia , lookup

Gene therapy wikipedia , lookup

Genomic imprinting wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene nomenclature wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Point mutation wikipedia , lookup

Public health genomics wikipedia , lookup

Transposable element wikipedia , lookup

Gene desert wikipedia , lookup

Gene expression profiling wikipedia , lookup

Non-coding DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Genome (book) wikipedia , lookup

Metagenomics wikipedia , lookup

Human genome wikipedia , lookup

History of genetic engineering wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Pathogenomics wikipedia , lookup

Human Genome Project wikipedia , lookup

Designer baby wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Gene Prediction in Prokaryotes
Gene prediction is the process of identifying the regions of genomic DNA that encode genes.
This step follows after the genome of a species has been sequenced. In general, the process
includes identifying protein-coding regions, RNA genes, and regulatory regions.
The three major categories of gene prediction algorithms are alignment-based, sequence-based,
and content-based. Some algorithms are consensus-based i.e. they combine the results of
multiple programs.
Alignment-based algorithms are based on finding orthologs of the query sequence. If an
ortholog is found, one may extrapolate that the gene being queried is probably a similar gene
with a similar structure and function. BLAST is widely used for this approach. However, this
method will not work for a new gene, for which currently there is no ortholog in the databases.
Sequence-based algorithms look for gene signals, i.e. certain patterns found in a gene. This
includes start codon (AUG), stop codon (TAA, TAG, TGA), promoter sequence, intron-exon
boundaries (GT, AG), Shine-Dalgarno sequence or ribosomal binding sites (5’AGGAGG), polyA tails. ORF-Finder programs are good examples of sequence-based algorithms.
Content-based algorithms look for certain characteristics or patterns of coding regions which
are significantly different from those found in non-coding regions. This includes nucleotide
frequency or coding frequency in a particular organism. Identifying CpG islands is an example
of this method.
Gene prediction is comparatively easier in prokaryotes than eukaryotes. The bacterial genome
usually consists of a single circular chromosome and ranges in size from a million to to 10
million bps; eukaryotic genomes are much larger and range in size from 10Mbp-670Gbp.
Prokaryotes have high gene density, 85% of the genome consists of coding genes. Eukaryotes
have low gene density, 3% of the genome codes for genes, spaces between genes is very large in
repetitive sequences and tranposable elements. This module is devoted to exploring prokaryotic
genes.
Identifying an ORF. The usual start codon in prokaryotes is ATG. However, GTG and TTG may
act as START signals. START codon occurs every 20 codons by chance in a non coding
sequence. Any frame longer than 30 codons without a STOP is a suspect ORF. Threshhold is
usually at 50-60 codons. Furthermore, Ribosomal binding site (ShineDelgarno sequence) located
upstream of START codon verifies the presence of an ORF (protein coding region). Further
confirmation of protein coding region can be obtained by translating this sequence into protein
sequence and search protein database for orthologs.
1
We will use several gene prediction programs to identify putative genes. Each gene may be
translated into protein sequence and searched against protein databases for orthologs. However,
the emphasis, in this document, will be on sequence-based algorithms:
1. ORF Finder program http://www.ncbi.nlm.nih.gov/genome/browse/
2. Easy Gene
http://www.cbs.dtu.dk/services/EasyGene/
3. Neural Network Promoter Prediction
http://www.fruitfly.org/seq_tools/promoter.html
4. Virtual Footprint
http://www.prodoric.de/vfp/
5. Glimmer3 – NCBI and gene ID http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
I.
Find a known gene and acquire its DNA sequence
1. Open IMG home page - http://img.jgi.doe.gov/cgi-bin/edu/main.cgi
2. Under Find Genome, select Genome Search.
3. Enter Cryptobacterium curtum in the Key Word search field; Select
Filter Genome Name. Click GO.
4. Search results will be displayed. Click Cryptobacterium curtum DSM
15641 under Genome Name.
5. The resulting page shows the details of Cryptobacterium curtum, DSM
15641 in four sections:
Overview; Statistics; Expression Studies; Genes.
6. Review the information and answer following questions:
i. What disease is caused by Cryptobacterium curtum?
Hint: Look under Habitat in Overview.
ii. How many protein coding genes are displayed?
7. Click on the number (of genes) displayed and on the resulting page find Gene
ID 644993009 - Fe-dependent oxidoreductase, alcohol dehydrogenase.
8. Click on the ID No. 644993009 to see the Gene Information.
Under External Links, click on GI:256826541 .
9. Detailed information on the gene will be displayed. Click on FASTA on the top
and Download the protein sequence displayed. Save the sequence.
2
II.
Find ORFs with NCBI ORF Finder program
1. Open NCBI Education Website
http://www.ncbi.nlm.nih.gov/
2. Click on Nucleotide under NCBI Educational Resources.
3. Enter GenBank Acc. No. U33186 in the search box; click search.
4. On the page displayed click on FASTA to download the complete sequence of
Mycobacterium fortuitum/Escherichia coli shuttle vector pSUM36
5. Click on the ORF Finder program link http://www.ncbi.nlm.nih.gov/genome/browse/
6. Either paste the sequence of Plasmid vector pSUM36 or enter the GenBank Accession
No. U33186 for Plasmid vector pSUM36. Click ORFFind.
7. All ORFs in six reading frames will be displayed.
Examin the output and answer the following questions:
1.
2.
3.
4.
5.
What is a plasmid?
What are Reading Frames; why six?
What is the minimum length of an ORF?
What is displayed when you click on an ORF or its graphic?
How will you carry out BLAST search on the ORF you are interested in?
III. Find genes with EasyGene
The EasyGene 1.2 server produces a list of predicted genes given a sequence of
prokaryotic DNA. The current version contains models for 138 different organisms.
The user can pick one of these 138 organisms that is closely related to the one from
which the query sequence was taken. These prokaryotic genomes provide training sets for
the program. EasyGene will pick potential ORFs with scores above a specified threshold.
1. Open NCBI Education Website
http://www.ncbi.nlm.nih.gov/
2. Click on Nucleotide under NCBI Educational Resources.
3. Search for pBR322 Plasmid in NCBI, Nucleotide search
(Ref: Microbiology, 5th Ed. P. 334, Prescott, Harley, Kline)
4. On the page displayed click on FASTA to download the complete sequences of
BiFC vector pFAGN173 and
3
Escherichia coli ETEC H10407 plasmid p52
5. Connect to EasyGene 1.2 server.
http://www.cbs.dtu.dk/services/EasyGene/
6. Click Browse and point to the saved sequence of BiFC vector pFAGN173.
(Alternatively, you may copy and paste the sequence in the text box).
7. Select Escherichia coli K12, a closely related organism (from 138 Model organisms).
Select the R-value of 2.
8. Click Submit.
9. Repeat steps 6-9 for Escherichia coli ETEC H10407 plasmid p52.
Review the outputs. How many ORFs were found in BiFC vector pFAGN173 and in BiFC
vector pFAGN173?
Exercise (Optional):
Use NCBI ORF Finder program to identify ORFs in BiFC vector pFAGN173.
Compare the results with those from EasyGene. Do you see any differences?
IV. Neural Network Promoter Prediction
a. Berkeley Drosophila Genome Project (BDGP)
This program searches for -10 and -35 promotor sequences in a DNA sequence of a
prokaryote.
1. Search for Mycobacterium tuberculosis gene – fas in NCBI Gene database.
2. On the search page click on FASTA under Reference Assembly, Genome and download
the sequence.
3. Open Neural Network Promoter Prediction Site http://www.fruitfly.org/seq_tools/promoter.html
This will open up the Berkley Drosophila Genome Project web site.
4. Select Prokaryote.
5. Paste the sequence in the box and click Submit.
6. The program predicts two promoters. The transcription start is shown in larger font.
4
b. Virtual Footprint
This program searches for regulatory binding sites in a given DNA sequence.
1. Open Virtual Footprint 2.
3.
4.
5.
http://www.prodoric.de/vfp/
Click Promoter Analysis. It analyzes promoter region with several
regulatory patterns.
Paste the raw sequence (Remove the first line in the FASTA sequence) in
the “Paste Raw Sequence” box.
Select the pattern, “Sig 70 (-10) Escherichia coli” and click START to
start searching.
Three binding sites are displayed by default. This number can be changed
if needed.
V. Glimmer3
Glimmer (Gene Locator and Interpolated Markov ModelER) can be used to find genes in
bacteria, archaea, and viruses. Glimmer3 uses interpolated Markov models (IMMs) to
identify the coding regions and distinguish them from noncoding DNA. Glimmer was
developed at The Institute for Genomic Research (TIGR). It has been used to annotate the
complete genomes of over 100 bacterial species. It can be accessed at the NCBI Glimmer
site or at geneid web site.
1. Open NCBI Education Website
http://www.ncbi.nlm.nih.gov/
2. Click on Genome under NCBI Educational Resources.
3. Cick on Browse By Organism
4. Search for Cryptobacterium curtum. On the next page, click on Cryptobacterium
curtum (at the bottom of the screen)
5
5. On the page displayed click on FASTA to download complete genome of
Cryptobacterium curtum DSM 15641 chromosome.
6. Open Glimmer3 - http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
This will open up Microbial Genome Annotation Tools page.
7. Select (Bacteria, Archea) under Genetic code.
8. Select Circular under Topology.
9. Paste the sequence in the box and click Run Glimmer v3.02.
Note: Since the Cryptobacterium curtum genome is 1.5 Mb long, only a part of the
genome was pasted in the above exercise.
10. Repeat above exercise with Mycobacterium tuberculosis genome.
11. Open Glimmer3 - http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi
This will open up Microbial Genome Annotation Tools page.
12. Select 4 (Microplasma/Spiroplasma) under Genetic code.
13. Select Circular under Topology.
14. Paste the sequence in the box and click Run Glimmer v3.02.
15. Glimmer3 Output displays predictions in Mycobacterium tuberculosis genome
6