Download Overview of Gene Finding

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia , lookup

The Selfish Gene wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
CS5238 Combinatorial methods in bioinformatics
Gene Finding (Draft Report, Aug 2002)
1.0 Team Objective
The objective of this very first report is to submit an overall review on gene finding related topics
and issues. We have highlighted some current issues and open problems in this area. In coming
reports, our team hopes to further our survey and research in some noteworthy open problem.
2.0 Project Motivation
The size of GenBank is expected to exceed 3 billion base pairs upon the completion of HGP in
2003. Roughly 90% of the human genome is non-coding, namely is not a template for a protein.
Due to size of the database, manual searching of genes, which do code for proteins, is not
practical. Methods that help the biologists focusing their search are needed. This is where gene
finding comes into the picture. Gene identification and gene discovery in new genomic sequences
is one of the most time-costing computational questions addressed by bioinformatics scientists
[1]. The accuracy of computational gene-finding methods has improved significantly in past few
decades. Now close to accurate prediction for an extended genomic region of the gene structures
can be do before more detailed experimental studies.
3.0 Biological Background
-
-
Gene expression: biological process by which a DNA sequence generates a protein
(transcription and translation).
Transcription: process that produces a mRNA sequence using the DNA sequence. The
enzyme performing the transcription is RNA polymerase.
Translation: process that synthesizes the protein from the mRNA with the help of
ribosomes.
Untranslated regions (UTR) : The regions in both ends of the DNA coding region which
are transcripted into mRNA, but do not code the protein.
Promoter region: A DNA sequence to which RNA polymerase binds prior to initiation of
transcription. It is usually found just upstream of the transcription start site of a gene. It
controls the rate of gene expression.
Reading frames: 3 different ways to interpret DNA sequence depending on where the
coding starts.
Open reading frames (ORF): a sequence of codons with no stop codon.
4.0 What is Gene Finding?
-
-
Generate predictions of gene locations from primary genomic sequence by computational
means.
Identify coding regions from stretches of DNA of most interest are protein coding
regions, part of genes that are not only transcribed into RNA, but also translated into
protein.
Therefore the task of gene finding is basically to separate the coding regions, non-coding
regions and intergenic regions. So, the problem of gene finding can be formulated as
follows:
Input: A sequence of DNA, X = x1 x2 x3… xn, where each xi belongs to {A, C, G, T}
and n is a positive integer.
Output: Correct the labeling of each element in X as a belonging to a coding region,
non-coding region or intergenic region.
Please visit our website on http://soctf-proj-035/bioinfo
CS5238 Combinatorial methods in bioinformatics
Gene Finding (Draft Report, Aug 2002)
5.0 Gene Finding Problems
-
-
DNA sequence signals have low information content (degenerated & highly specific)
It is difficult to discriminate real signals (transcriptional signals, translational signals and
splicing signals)
Sequencing errors
Not all organisms use the same genetic code. Variations occur mainly in mitochondrial
and plastid genomes, but also in some nuclear genomes of protists and yeast.
Overlapping genes: especially common in viruses. Genes encoded in different reading
frames on the same or on the complementary DNA strand.
Challenges in finding genes in prokaryotes
- higher gene density
- absence of introns in their protein coding genes.
- Most ORFs is longer than some reasonable threshold
- Primary difficulties:
 More than one protein coding region per mRNA.
 Very small genes will be missed.
 The occurrence of overlapping long ORFs on opposite DNA strands often
leads to ambiguities.
Challenges in finding genes in eukaryotes
o Intron-exon structure of genes including different splicing.
o RNA editing: know to occur in plant mitochondria and plastids, as well as in the
C. elegans nuclear genome.
6.0 Gene Finding Strategies
-
Combinatorial approach can be used, relying on a number of methods, to increase the
confidence with which gene structure is predicted
Have to decide when and how each particular method should be applied.
Gene finding strategies can be grouped into 3 major categories [2]
1. Content-based methods
o Rely on the overall, bulk properties of a sequence in making a determination
o Characteristics considered:
 How often particular codons are used
 Periodicity of repeats
 Compositional complexity of the sequence
2. Site-based methods
o Presence or absence of a specific sequence, pattern or concensus
o Used to detect features such as donor and acceptor splice site, binding sites for
transcription factors, polyA tracts, and start and stop codons
3. Comparative methods
o Make determinations based on sequence homology
o Translated sequences are subjected to database searches against protein
sequences to determine whether a previously characterized coding region
correspond to a region in the query sequence
o More restrictive because most newly discovered genes do not have gene products
that match anything in the protein databases
o Modular native of proteins and the fact that there are only a limited number of
protein motifs make predicting anything more than just exonic regions using this
method difficult
Please visit our website on http://soctf-proj-035/bioinfo
CS5238 Combinatorial methods in bioinformatics
Gene Finding (Draft Report, Aug 2002)
7.0 Gene Finding Algorithms
-
-
-
-
Hidden Markov Models (HMM)
o Stochastic model that captures the statistical properties of observed real world
data.
o HMM model describes a process in which some of the details are unknown, or
hidden.
o Henderson et. al.[3] have described a new HMM system for segmenting
uncharacterized human genome DNA into exons, introns and intergenic regions.
Three separate models were designed for each of the three types of human DNA
(exons, introns and intergenic regions).
o A brief review and basics of HMM has been studied in [4]. A brief mathematical
formulation of the HMM (described above, paragraph 1) has been presented in a
context of computational biology and they have performed an experiment for
finding genes in human genomic DNA with the Baum-Welch and CML
algorithms.
o Critical comments on HMM:
 Cannot predict accurate boundaries of exons.
 Modeling HMM is complicated in the sense that it requires good
biological knowledge for selecting states and transition. Incorrect model
give rise to wrong prediction about the gene. Model selection is most
basic and fundamental building block of HMM.
 As HMM model consists of lots of unknown parameters, training of
model is consuming process and a good amount of training data is
required. For making the process efficient, careful selection is
recommended so that model is sufficiently effective and
 It is really very difficult to understand the inter-state behavior of a MHH.
Bayesian Model
o In [5], Crowley et. al. have described a statistical model for locating regulatory
regions in a Genomic DNA. In addition to gene, chromosomal DNA contains
sequences that serve as a signal for turning on and off gene expression. These
regions are called control regions. In this work the authors tried to develop an
automatic Bayesian model for finding those regulating regions in a genomic
DNA.
o In [1], Pavlovic et. al. have proposed a Bayesian network framework for
combining gene predictions from multiple gene-finding systems using Hidden
Input/Output Markov models. Improved prediction accuracy is the strength of the
developed system, which is known as GeneHacker Plus.
Linear discriminant analysis [6]:
o Applied to predict 5*-terminal, internal, 3*-terminal exons (coding-exon) and
introns in a fast way.
o Shortcoming: fixed threshold and relatively easy algorithm will hurt the
adapbility and precisions.
Sequence similarity search, or lookup method [7]:
o Base on the concept that high similarity in 2 sequences will lead to similar
function or similar 3D structure.
o The method used not only for gene detection, but for detailed prediction of the
exon–intron structure as well.
o Shortcoming: A assumption should be satisfied is there is a related protein is
known; and the compare method is very costly.
Please visit our website on http://soctf-proj-035/bioinfo
CS5238 Combinatorial methods in bioinformatics
Gene Finding (Draft Report, Aug 2002)
8.0 Some Examples of Gene Finding Programs
1.
2.
3.
4.
5.
6.
7.
8.
GRAIL (Gene Recognition and Analysis Internet Link)
FGENEH/FGENES
MZEF (Michael Zhang’s Exon Finder)
GENSCAN
PROCRUSTES
GeneID
GeneParser
HMMgene
9.0 Strategies and Considerations for Gene-finding Program
-
-
10.0
Several major drawbacks that most gene identification programs share:
o Most of these methods are trained on test data, they will work best in finding
genes most similar to those in the training sets
o Methods have an absolute requirement to predict both a discrete beginning and
an end to a gene, meaning that these methods may miscall a region that consists
of either a partial gene or multiple genes
o The importance given to each individual factors in deciding whether a stretch of
sequence is an intron or an exon can also influence outcomes, as the weighing of
each criterion may be either biased or incorrect
o There is unusual case of genes that are transcribed but not translated (so-called
“noncoding RNA genes”)
Some suggestions in gene identification effort:
o Incompletely assembled sequence contigs – MZEF
o Nearly finished or finished data – GENSCAN or HMMgene
o Users should supplement these predictions with results from at least one other
predictive method, as consistency among methods can be used as qualitative
measure of the robustness of the results
o Utilization of comparative search methods, such as BLAST or FASTA should be
considered an absolute requirement, with users targeting both dbEST and the
protein databases for homology-based clues
o PROCRUSTES should be used when some information regarding the putative
gene product is known, particularly when the cloning efforts are part of
positional candidate strategy
o Combinatorial approach can be developed- i.e. GeneMachine
Open Problems and Future Directions
Existing gene-finding programs still have several important limitations [3]:
1. Most programs only predict protein coding genes and not genes whose products function
exclusively at the RNA level.
2. No current method can deal effectively with overlapping genes in eukaryotes and
prediction of multiple genes in a single sequence is still difficult.
3. The problem of multiple protein products that correspond to a single gene through
alternative splicing, alternative transcription and/or alternative translation has not yet
been dealt with effectively.
Please visit our website on http://soctf-proj-035/bioinfo
CS5238 Combinatorial methods in bioinformatics
Gene Finding (Draft Report, Aug 2002)
4. Development of improved methods for identifying the promoter regions is an important
challenge for the next several years.
Recommended Readings
Papers and search results related to gene finding are compiled and listed in our project website.
There are 4 sections in the website:
- Paper list: Papers related to gene finding
- General concept on gene finding: Some material or lecture notes on gene finding
- Drafts: Our draft report and summary for papers inside paper list.
- Useful link : Useful links about gene finding
Please visit our project website to know more about gene finding
http://soctf-proj-035/bioinfo
References
[1] Pavlovic V, Garg A and Kasif S: A Bayesian Framework for Combining Gene Predictions.
Bioinformatics, 2001.
[2] Andreas DB: Predictive methods using DNA sequences. Bioinformatics: A Practical Guide to the
Analysis of Genes and Proteins, 2001.
[3] Henderson J, Salzberg S and Fasman KH: Finding Genes in Human DNA with a Hidden Markov
Model. J Comput Biol 1997, 4: 127-141.
[4] Smith K: Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA.
Machine Learning Project 2002: 308-761
[5] Burge CB and Karlin S: Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 1998, 346-354.
[6] Chen T, Zhang MQ: Pombe: A gene-finding and exon-intron structure prediction system for fission
yeast. Yeast Vol. 14 1998: 701-710
[7] Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. Proc Natl
Acad Sci USA 1996: 9061-9066
[8] Rogic S, Mackworth AK and Ouellette F: Evaluation of Gene-Finding Program on Mammalian
Sequences. Genome Res Vol. 11 Issue 5 2001: 817-832
Please visit our website on http://soctf-proj-035/bioinfo