* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Prediction in Eukaryotes
Biology and consumer behaviour wikipedia , lookup
Oncogenomics wikipedia , lookup
Genomic library wikipedia , lookup
Frameshift mutation wikipedia , lookup
Cancer epigenetics wikipedia , lookup
Ridge (biology) wikipedia , lookup
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Metagenomics wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Public health genomics wikipedia , lookup
Transposable element wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Genomic imprinting wikipedia , lookup
Copy-number variation wikipedia , lookup
Minimal genome wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Human genome wikipedia , lookup
Primary transcript wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Genetic code wikipedia , lookup
Pathogenomics wikipedia , lookup
Non-coding DNA wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genetic engineering wikipedia , lookup
Gene therapy wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression programming wikipedia , lookup
Genome (book) wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Point mutation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene desert wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome editing wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Microevolution wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Genomics: Gene prediction and Annotations Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal Gene Prediction Strategies TAA TAG TGA Prokaryotes Gene Architecture Initiation -36 -10 ATG Protein 1 Promoter Protein 2 Protein 3 Termination Exon-2 Termination Gene Regulatory Seq. ATG Initiation Exon-1 Intron-1 Splicing Sites Eukaryotes Gene Architecture TAA TAG TGA Codon Usage Tables Each amino acid can be encoded by several codons Each organism has characteristic pattern of codon usage Problems in Gene Prediction Distinguishing Pseudogenes from Genes Exon-Intron Structure in Eukaryotes, Exon flanking regions – not very well conserved Alternative Splicing – Shuffling of Exons Genes can overlap each other and occur on different strand of DNA Gene Identification 1. Homology Based Gene prediction Sequence Similarity Search against gene database using BLAST and FAST searching tools EST (Expressed Sequence Tags) similarity search 2. Ab initio Gene Prediction Prokaryotes - ORF finding Eukaryotes - Promoter prediction - Start-Stop codon prediction - Splice site Prediction (Exon-Intron and Intron –Exon) - PolyA signal prediction ORF Finding in Prokaryotes Easier due to ……….. Small Genome have high gene density (Haemophilus influenza – 85% genic) No Introns or Few Introns Operons - One Transcript, many genes Open Reading Frames (ORF) - Contigous set of codons, start with Met-codon, ends with stop codon 1. ORF Findings: Simplest method Length of DNA sequence that contains a contiguous set of codons, each of which specifies an Amino Acid Six possible reading frames Start Codon Sense Strand Antisense Strand 3’ 1 2 3 A T G C C A T C A G T G C C A T T G T A 5’ 3 Position 3 Position 2 Position 1 Central Dogma DNA 5’ mRAN 2 1 Start Codon Protein 3’ ORF Prediction: Based on Position of Start Codon & Stop Codon Start Codon A U ORF Stop Codon G OR OR Protein Coding Region U G A U A A U A G No Protein: Code for Protein Due to the Presence of many in-frame stop codons Example of ORF There are six possible ORFs in each sequence for both directions of transcription. Difficulty in ORF Prediction: 1. Prokaryotes & Viruses: Presence of multiple genes on mRNA and Overlapping genes in which two different proteins may be encoded in different reading frames of the same mRNA 2. Eukaryotes: Protein coding region (Exon) is followed by non-coding region (Intron) 3. Differential mRNA splicing create different mRNA, hence different proteins 4. Variation in Genetic Code from Universal code Reliability of ORF Prediction: Characteristics of ORF regions 1. Ordered list of specific codons that reflects the evolutionary origin of the gene and constraints associated with gene expressions 2. Characteristics pattern of use of synonymous codons i.e. codons that stands for same Amino Acid 3. In Eukaryotes strong preferences for codon pairs at Intron-Exon or Exon-Intron junction 4. High genome content of GC have a strong bias of G & C in the third codon positions 3 Test of ORF First Test: It is based on an unusual type of sequence variation that is found in ORF have been devised to variety that a predicted ORF is in fact likely to encode a protein Second Test: It is analyzed, to determine whether the codon in the ORF correspond to these used in other genes of the same organism Third Test: ORF may be translated into an amino acid sequence and the resulting sequence then compound to the databases of existing sequence Repeated Sequence Elements and Nucleosome Structure 1. Eukaryotic DNA is wrapped around histon-protein complexes 2. Some base pairs in the major or minor grooves of the DNA molecules face the nucleosome surface 3. Other pair face outside of the structures 4. Nucleosome located in the promoter regions are remodeled in a manner that can influence the availability of binding sites for regulatory proteins making them more or less available Hidden Morkov Model (HMM) of Eukaryotic Internal Exon Computational Background: Repeated patterns of sequence have been found in the Introns and Exons and near the start site of Transcriptuion of Eukaryotic genes Bending Pattern: Bending is influenced by 1. Repeated pattern i.e. not T, A or G, G 2. AA/TT dinucleotide Ab initio gene prediction Predictions are based on the observation that gene DNA sequence is not random: - Gene-coding sequence has start and stop - codons. Each species has a characteristic pattern of synonymous codon usage. Non-coding ORFs are very short. Gene would correspond to the longest ORF. These methods look for the characteristic features of genes and score them high. Ab initio gene prediction methods GeneScan – Fourier transform of DNA sequence to find characteristic patterns. GeneParser – predicts the most likely combination of exons/introns. Dynamic programming. GeneMark – mostly for prokaryotes, Hidden Markov Models. Also for Eukaryotes Grail II – predicts exons, promoters, Poly(A) sites. Neural network plus dynamic programming. Gene Preference Score : Important indicator of coding region Observation: frequencies of codons and codon pairs in coding and non-coding regions are different. Given a sequence of codons: and assuming independence, the probability of finding coding region: The probability of finding sequence “C” in non-coding regions: The gene preference score: P(C ) GPS log( ) P0 (C ) Confirming gene location using EST libraries Expressed Sequence Tags (ESTs) – sequenced short segments of cDNA. They are organized in the database “UniGene”. If region matches ESTs with high statistical significance, then it is a gene or pseudogene. Gene prediction accuracy True positives (TP) – nucleotides, which are correctly predicted to be within the gene. Actual positives (AP) – nucleotides, which are located within the actual gene. Predicted positives (PP) – nucleotides, which are predicted in the gene. Sensitivity = TP / AP Specificity = TP / PP Gene prediction accuracy Common Difficulties of Gene Prediction First and last exons difficult to annotate because they contain UTRs. Smaller genes are not statistically significant so they are thrown out. Algorithms are trained with sequences from known genes which biases them against genes about which nothing is known. Genome Analysis for Gene Prediction Genome analysis Genome – the sum of genes and intergenic sequences of haploid cell. The value of genome sequences lies in their annotation Annotation – Characterizing genomic features using computational and experimental methods Genes: levels of annotation Gene Prediction – Where are genes? What do they encode? What proteins/pathways involved in? Flowchart: Gene Prediction Process Genomic DNA Sequence Analyze the Regulatory Sequences in the Gene 1. Translate in all six Reading Frames & compare to Protein sequence database 2. Perform database similarity search of EST database of some Organism Use Gene Prediction program to locate genes Try this first using BLAST & FASTA PSI-BLAST, PHI-BLAST & Other BLAST/FAS TA programs & EST, cDNA database search Compare with Genome of Other Organism ORF Finding Promoter, Splicing Site, Poly-A tail, 5’ TUR, 3’ UTR Let’s have some Practice on Gene Finding using some Gene Finding Programs 1. GenMark (http://exon.gatech.edu/GeneMark/ ) 2. Genscan (http://genes.mit.edu/GENSCAN.html ) 3. Grail II (http://compbio.ornl.gov/Grail-1.3/ ) 4. Gene Finder in GlimmerM (http://www.tigr.org/tdb/glimmerm/glmr_form.ht ml ) HMMgene - Prediction of genes in vertebrate and C. elegans Gene Discovery Page FramePlot - protein-coding region prediction tool for high GC-content bacteria tRNAscan-SE Search for transfer RNA genes in genomic sequence NETGENE - Predict splice sites in human genes ORF Finder BCM Gene Finder Grail Genemark Genie: A Gene Finder Based on Generalized Hidden Markov Models GENSCAN - predict complete gene structures Splice Site Prediction by Neural Network Procrustes GenePrimer GenLang MZEF Gene Finder Webgene - Tools for prediction and analysis of protein-coding gene structure MAR-Finder - Nuclear matrix attachment region prediction Glimmer bacterial/archael gene finder Promoter Region, Transscription Factor and Signals 1. TRANSFAC - Transcription Factor database TFD Transcription Factor Database TransTerm - A Translational Signal Database PLACE - a database of plant cis-acting regulatory DNA elements NNPP: Promoter Prediction by Neural Network FastM/ModelInspector TFSEARCH MatInd and MatInspector Transcription Element Search Software (TESS) CorePromoter (Core-Promoter Prediction Program) Gene Express - analysis of genomic regulatory sequences Signal Scan PromoterInspector Promoter Scan II Pol3scan TargetFinder - finds DNA-binding proteins. Overview GENE PREDICTION TOOLS TM GenMark (http://exon.gatech.edu/GeneMark/ ) Mark Borodovsky's Bioinformatics Group at the Georgia Institute of Technology, Atlanta, Georgia GeneMark.hmm for Prokaryotes (Version 2.4) Referen ce: Lukashin A. and Borodovsky M., GeneMark.hmm: new solutions for gene finding, NAR, 1998, Vol. 26, No. 4, pp. 1107-1115 Bacterial and archaeal gene prediction, you can use the parallel combination of the GeneMark and GeneMark.hmm programs Heuristic Approach for Gene Prediction in Prokaryotes If the DNA sequence of interest belongs to a species whose name is not in the list of available models, use the Heuristic models option Self Training Program of Genmarks If the sequence is longer than 1 Mb, generate models with the selftraining program GeneMarkS Gene Prediction in Eukaryotes Eukaryotic gene prediction: Use the parallel combination of the GeneMark and GeneMark.hmm Select the Related Organisms from this list Gene Prediction in EST and cDNA To analyze ESTs and cDNAs Gene Prediction in Viruses Viral gene prediction through virus database “VIOLIN” GenMark Output GenMark Output New GENSCAN Web Server at MIT Genescan Output GrailEXP 1. Locate protein coding genes within DNA sequence, 2. Locate EST/mRNA alignments, 3. Locate certain types of promoters, polyadenylation sites, CpG islands, and repetitive elements. GrailEXP is a gene finder…………. 1. EST alignment utility 2. exon prediction program, 3. a promoter/polya recognizer, 4. a CpG island finer, 5. a repeat masker, GrailEXP Predicts exons, genes, promoters, polyas, CpG islands, EST similarities, and repetitive elements within DNA sequence GlimmerM: http://www.tigr.org/tdb/glimmerm/glmr_form.html A system for finding genes in microbial DNA, especially the genomes of bacteria and archaea.Glimmer (Gene Locator and Interpolated Markov Modeler) uses interpolated Markov models (IMMs) to identify the coding regions and distinguish them from noncoding DNA. GlimmerHMM: For Eukaryotic Organisms Genesplicer: Fast, flexible system for detecting splice sites in the genomic DNA of various eukaryotes. GLimmerM Gene Finder Kishor K. Shende Information Officer Bioinformatics Center, Barkatullah University Bhopal