Download Overview of Gene Finding

CS5238 Combinatorial methods in bioinformatics Gene Finding (Draft Report, Aug 2002) 1.0 Team Objective The objective of this very first report is to submit an overall review on gene finding related topics and issues. We have highlighted some current issues and open problems in this area. In coming reports, our team hopes to further our survey and research in some noteworthy open problem. 2.0 Project Motivation The size of GenBank is expected to exceed 3 billion base pairs upon the completion of HGP in 2003. Roughly 90% of the human genome is non-coding, namely is not a template for a protein. Due to size of the database, manual searching of genes, which do code for proteins, is not practical. Methods that help the biologists focusing their search are needed. This is where gene finding comes into the picture. Gene identification and gene discovery in new genomic sequences is one of the most time-costing computational questions addressed by bioinformatics scientists [1]. The accuracy of computational gene-finding methods has improved significantly in past few decades. Now close to accurate prediction for an extended genomic region of the gene structures can be do before more detailed experimental studies. 3.0 Biological Background - - Gene expression: biological process by which a DNA sequence generates a protein (transcription and translation). Transcription: process that produces a mRNA sequence using the DNA sequence. The enzyme performing the transcription is RNA polymerase. Translation: process that synthesizes the protein from the mRNA with the help of ribosomes. Untranslated regions (UTR) : The regions in both ends of the DNA coding region which are transcripted into mRNA, but do not code the protein. Promoter region: A DNA sequence to which RNA polymerase binds prior to initiation of transcription. It is usually found just upstream of the transcription start site of a gene. It controls the rate of gene expression. Reading frames: 3 different ways to interpret DNA sequence depending on where the coding starts. Open reading frames (ORF): a sequence of codons with no stop codon. 4.0 What is Gene Finding? - - Generate predictions of gene locations from primary genomic sequence by computational means. Identify coding regions from stretches of DNA of most interest are protein coding regions, part of genes that are not only transcribed into RNA, but also translated into protein. Therefore the task of gene finding is basically to separate the coding regions, non-coding regions and intergenic regions. So, the problem of gene finding can be formulated as follows: Input: A sequence of DNA, X = x1 x2 x3… xn, where each xi belongs to {A, C, G, T} and n is a positive integer. Output: Correct the labeling of each element in X as a belonging to a coding region, non-coding region or intergenic region. Please visit our website on http://soctf-proj-035/bioinfo CS5238 Combinatorial methods in bioinformatics Gene Finding (Draft Report, Aug 2002) 5.0 Gene Finding Problems - - DNA sequence signals have low information content (degenerated & highly specific) It is difficult to discriminate real signals (transcriptional signals, translational signals and splicing signals) Sequencing errors Not all organisms use the same genetic code. Variations occur mainly in mitochondrial and plastid genomes, but also in some nuclear genomes of protists and yeast. Overlapping genes: especially common in viruses. Genes encoded in different reading frames on the same or on the complementary DNA strand. Challenges in finding genes in prokaryotes - higher gene density - absence of introns in their protein coding genes. - Most ORFs is longer than some reasonable threshold - Primary difficulties:  More than one protein coding region per mRNA.  Very small genes will be missed.  The occurrence of overlapping long ORFs on opposite DNA strands often leads to ambiguities. Challenges in finding genes in eukaryotes o Intron-exon structure of genes including different splicing. o RNA editing: know to occur in plant mitochondria and plastids, as well as in the C. elegans nuclear genome. 6.0 Gene Finding Strategies - Combinatorial approach can be used, relying on a number of methods, to increase the confidence with which gene structure is predicted Have to decide when and how each particular method should be applied. Gene finding strategies can be grouped into 3 major categories [2] 1. Content-based methods o Rely on the overall, bulk properties of a sequence in making a determination o Characteristics considered:  How often particular codons are used  Periodicity of repeats  Compositional complexity of the sequence 2. Site-based methods o Presence or absence of a specific sequence, pattern or concensus o Used to detect features such as donor and acceptor splice site, binding sites for transcription factors, polyA tracts, and start and stop codons 3. Comparative methods o Make determinations based on sequence homology o Translated sequences are subjected to database searches against protein sequences to determine whether a previously characterized coding region correspond to a region in the query sequence o More restrictive because most newly discovered genes do not have gene products that match anything in the protein databases o Modular native of proteins and the fact that there are only a limited number of protein motifs make predicting anything more than just exonic regions using this method difficult Please visit our website on http://soctf-proj-035/bioinfo CS5238 Combinatorial methods in bioinformatics Gene Finding (Draft Report, Aug 2002) 7.0 Gene Finding Algorithms - - - - Hidden Markov Models (HMM) o Stochastic model that captures the statistical properties of observed real world data. o HMM model describes a process in which some of the details are unknown, or hidden. o Henderson et. al.[3] have described a new HMM system for segmenting uncharacterized human genome DNA into exons, introns and intergenic regions. Three separate models were designed for each of the three types of human DNA (exons, introns and intergenic regions). o A brief review and basics of HMM has been studied in [4]. A brief mathematical formulation of the HMM (described above, paragraph 1) has been presented in a context of computational biology and they have performed an experiment for finding genes in human genomic DNA with the Baum-Welch and CML algorithms. o Critical comments on HMM:  Cannot predict accurate boundaries of exons.  Modeling HMM is complicated in the sense that it requires good biological knowledge for selecting states and transition. Incorrect model give rise to wrong prediction about the gene. Model selection is most basic and fundamental building block of HMM.  As HMM model consists of lots of unknown parameters, training of model is consuming process and a good amount of training data is required. For making the process efficient, careful selection is recommended so that model is sufficiently effective and  It is really very difficult to understand the inter-state behavior of a MHH. Bayesian Model o In [5], Crowley et. al. have described a statistical model for locating regulatory regions in a Genomic DNA. In addition to gene, chromosomal DNA contains sequences that serve as a signal for turning on and off gene expression. These regions are called control regions. In this work the authors tried to develop an automatic Bayesian model for finding those regulating regions in a genomic DNA. o In [1], Pavlovic et. al. have proposed a Bayesian network framework for combining gene predictions from multiple gene-finding systems using Hidden Input/Output Markov models. Improved prediction accuracy is the strength of the developed system, which is known as GeneHacker Plus. Linear discriminant analysis [6]: o Applied to predict 5*-terminal, internal, 3*-terminal exons (coding-exon) and introns in a fast way. o Shortcoming: fixed threshold and relatively easy algorithm will hurt the adapbility and precisions. Sequence similarity search, or lookup method [7]: o Base on the concept that high similarity in 2 sequences will lead to similar function or similar 3D structure. o The method used not only for gene detection, but for detailed prediction of the exon–intron structure as well. o Shortcoming: A assumption should be satisfied is there is a related protein is known; and the compare method is very costly. Please visit our website on http://soctf-proj-035/bioinfo CS5238 Combinatorial methods in bioinformatics Gene Finding (Draft Report, Aug 2002) 8.0 Some Examples of Gene Finding Programs 1. 2. 3. 4. 5. 6. 7. 8. GRAIL (Gene Recognition and Analysis Internet Link) FGENEH/FGENES MZEF (Michael Zhang’s Exon Finder) GENSCAN PROCRUSTES GeneID GeneParser HMMgene 9.0 Strategies and Considerations for Gene-finding Program - - 10.0 Several major drawbacks that most gene identification programs share: o Most of these methods are trained on test data, they will work best in finding genes most similar to those in the training sets o Methods have an absolute requirement to predict both a discrete beginning and an end to a gene, meaning that these methods may miscall a region that consists of either a partial gene or multiple genes o The importance given to each individual factors in deciding whether a stretch of sequence is an intron or an exon can also influence outcomes, as the weighing of each criterion may be either biased or incorrect o There is unusual case of genes that are transcribed but not translated (so-called “noncoding RNA genes”) Some suggestions in gene identification effort: o Incompletely assembled sequence contigs – MZEF o Nearly finished or finished data – GENSCAN or HMMgene o Users should supplement these predictions with results from at least one other predictive method, as consistency among methods can be used as qualitative measure of the robustness of the results o Utilization of comparative search methods, such as BLAST or FASTA should be considered an absolute requirement, with users targeting both dbEST and the protein databases for homology-based clues o PROCRUSTES should be used when some information regarding the putative gene product is known, particularly when the cloning efforts are part of positional candidate strategy o Combinatorial approach can be developed- i.e. GeneMachine Open Problems and Future Directions Existing gene-finding programs still have several important limitations [3]: 1. Most programs only predict protein coding genes and not genes whose products function exclusively at the RNA level. 2. No current method can deal effectively with overlapping genes in eukaryotes and prediction of multiple genes in a single sequence is still difficult. 3. The problem of multiple protein products that correspond to a single gene through alternative splicing, alternative transcription and/or alternative translation has not yet been dealt with effectively. Please visit our website on http://soctf-proj-035/bioinfo CS5238 Combinatorial methods in bioinformatics Gene Finding (Draft Report, Aug 2002) 4. Development of improved methods for identifying the promoter regions is an important challenge for the next several years. Recommended Readings Papers and search results related to gene finding are compiled and listed in our project website. There are 4 sections in the website: - Paper list: Papers related to gene finding - General concept on gene finding: Some material or lecture notes on gene finding - Drafts: Our draft report and summary for papers inside paper list. - Useful link : Useful links about gene finding Please visit our project website to know more about gene finding http://soctf-proj-035/bioinfo References [1] Pavlovic V, Garg A and Kasif S: A Bayesian Framework for Combining Gene Predictions. Bioinformatics, 2001. [2] Andreas DB: Predictive methods using DNA sequences. Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins, 2001. [3] Henderson J, Salzberg S and Fasman KH: Finding Genes in Human DNA with a Hidden Markov Model. J Comput Biol 1997, 4: 127-141. [4] Smith K: Hidden Markov Models in Bioinformatics with Application to Gene Finding in Human DNA. Machine Learning Project 2002: 308-761 [5] Burge CB and Karlin S: Finding the genes in genomic DNA. Curr. Opin. Struct. Biol. 1998, 346-354. [6] Chen T, Zhang MQ: Pombe: A gene-finding and exon-intron structure prediction system for fission yeast. Yeast Vol. 14 1998: 701-710 [7] Gelfand MS, Mironov AA, Pevzner PA: Gene recognition via spliced sequence alignment. Proc Natl Acad Sci USA 1996: 9061-9066 [8] Rogic S, Mackworth AK and Ouellette F: Evaluation of Gene-Finding Program on Mammalian Sequences. Genome Res Vol. 11 Issue 5 2001: 817-832 Please visit our website on http://soctf-proj-035/bioinfo

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Overview of Gene Finding