* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene Finding in Prokaryotes
Eukaryotic transcription wikipedia , lookup
Non-coding RNA wikipedia , lookup
List of types of proteins wikipedia , lookup
Ridge (biology) wikipedia , lookup
Expanded genetic code wikipedia , lookup
Genomic imprinting wikipedia , lookup
Epitranscriptome wikipedia , lookup
Gene desert wikipedia , lookup
Non-coding DNA wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Point mutation wikipedia , lookup
Gene expression wikipedia , lookup
Gene regulatory network wikipedia , lookup
Gene expression profiling wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Community fingerprinting wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Molecular evolution wikipedia , lookup
Genome evolution wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Biological Motivation Gene Finding Anne R. Haake Rhys Price Jones Gene Finding Why do it? • Find and annotate all the genes within the large volume of DNA sequence data – how many genes in an organism? homologies? • Gain understanding of problems in basic science – e.g. gene regulation-what are the mechanisms involved in transcription, splicing, etc? • Different emphasis in these goals has some effect on the design of computational approaches for gene finding. Gene Finding by Biological Methods: • Extract mRNA reverse transcribe cDNA Label cDNA DNA library Detecting by using cDNA probe Gene found Gene Finding by Computational Methods • Dependent on good experimental data to build reliable predictive models • Various aspects of gene structure/function provide information used in gene finding programs Figure 12.3 Figure 12.3 The Informatics View of Genes • Genes are character strings embedded in much larger strings called the genome • Genes are composed of ordered elements associated with the fundamental genetic processes including transcription, splicing, and translation. Gene Finding • Cells recognize genes from DNA sequence – find genes via their bioprocesses • Not so easy for us.. CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT TGTTATGAGACTGGATATAT... G CTAGCAGGGACCCCAGCGCCCGAGAGACCATGCAGAGGTCGCCT CTGGAAAAGGCCAGCGTTGTCTCCAAACTTTTTTTCAGGTGAGA AGGTGGCCAACCGAGCTTCGGAAAGACACGTGCCCACGAAAGAG GAGGGCGTGTGTATGGGTTGGGTTGGGGTAAAGGAATAAGCAGT TTTTAAAAAGATGCGCTATCATTCATTGTTTTGAAAGAAAATGT GGGTATTGTAGAATAAAACAGAAAGCATTAAGAAGAGATGGAAG AATGAACTGAAGCTGATTGAATAGAGAGCCACATCTACTTGCAA CTGAAAAGTTAGAATCTCAAGACTCAAGTACGCTACTATGCACT TGTTTTATTTCATTTTTCTAAGAAACTAAAAATACTTGTTAATA AGTACCTANGTATGGTTTATTGGTTTTCCCCCTTCATGCCTTGG ACACTTGATTGTCTTCTTGGCACATACAGGTGCCATGCCTGCAT ATAGTAAGTGCTCAGAAAACATTTCTTGACTGAATTCAGCCAAC AAAAATTTTGGGGTAGGTAGAAAATATATGCTTAAAGTATTTAT TGTTATGAGACTGGATATAT... Types of Genes • Protein coding – most genes • RNA genes – – – – rRNA tRNA snRNA (small nuclear RNA) snoRNA (small nucleolar RNA) 3 Major Categories of Information used in Gene Finding Programs • Signals/features = a sequence pattern with functional significance e.g. splice donor & acceptor sites, start and stop codons, promoter features such as TATA boxes, TF binding sites, CpG islands • Content/composition -statistical properties of coding vs. non-coding regions. – e.g. codon-bias; length of ORFs in prokaryotes;GC content • Similarity-compare DNA sequence to known sequences in database – Not only known proteins but also ESTs, cDNAs Looking for Protein Coding Genes • Look for ORF (begins with start codon, ends with stop codon, no internal stops!) – long (usually > 60-100 aa) – If homologous to “known” protein more likely • Look for basal signals – Transcription, splicing, translation • Look for regulatory signals – Depends on organism • Prokaryotes vs Eukaryotes • Vertebrate vs fungi Easier problem: Gene Finding in Bacterial Genomes Why? • Dense Genomes • Short intergenic regions • Uninterrupted ORFs • Conserved signals • Abundant comparative information – Complete Genomes available for many What do Prokaryotic Genes look like? 5’ 3’ Open Reading Frame Promoter region (maybe) Ribosome binding site (maybe) Termination sequence (maybe) Start codon / Stop Codon Prokaryotic Gene Expression Promoter Cistron1 Cistron2 CistronN Terminator Transcription RNA Polymerase mRNA 5’ 3’ 1 2 Translation SD in polycistronic message C N N N Ribosome, tRNAs, Protein Factors C N C 1 2 Polypeptides Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt 3 Open Reading Frame (ORF) • Any stretch of DNA that potentially encodes a protein • The identification of an ORF is the first indication that a segment of DNA may be part of a functional gene Open Reading Frames A C G T A A C T G A C T A G G T G A A T CGT GTA AAC ACT TGA GAC CTA TAG GGT GTG GAA AAT Each grouping of the nucleotides into consecutive triplets constitutes a reading frame. There are three different reading frames in the 5’->3’ direction and a further three in the reverse direction on the opposite strand. A sequence of triplets that contains no stop codon is an Open Reading Frame (ORF) ORFs as gene candidates • An open reading frame that begins with a start codon (usually ATG, GTG or TTG, but this is speciesdependent) • Most prokaryotic genes code for proteins that are 60 or more amino acids in length • The probability that a random sequence of nucleotides of length n has no stop codons is (61/64)n • When n is 50, there is a probability of 92% that the random sequence contains a stop codon • When n is 100, this probability exceeds 99% Codon Bias • Genetic code degenerate – Equivalent triplet codons code for the same amino acid – http://www.pangloss.com/seidel/Protocols/codon.html • Codon usage varies – organism to organism – gene to gene • Biological basis – Avoidance of codons similar to stop – Preference for codons that correspond to abundant tRNAs within the organism Codon Bias Gene Differences GlyGGG GlyGGA GlyGGT GlyGGC GAL4 0.21 0.17 0.38 0.24 ADH1 0 0 0.93 0.07 Slide modified from: http://biology.uky.edu/520/Lecture/lect8/lect8Notes.ppt Codon Bias Organism differences • Yeast Genome: arg specified by AGA 48% of time (other five equivalent codons ~10% each) • Fruitfly Genome: arg specified by CGC 33% of time (other five ~13% each) • Complete set of codon usage biases can be found at: http://www.kazusa.or.jp/codon/ GC content • GC relative to AT is a distinguishing factor of bacterial genomes • Varies dramatically across species – Serves as a means to identify bacterial species • For various biological reasons – Mutational bias of particular DNA polymerases – DNA repair mechanisms – horizontal gene transfer (transformation, transduction, conjugation) GC Content • GC content may be different in recently acquired genes than elsewhere • This can lead to variations in the frequency of codon usage within coding regions – There may be significant differences in codon bias within different genes of a single bacterium’s genome Ribosome Binding Sites • RBS is also known as a Shine-Dalgarno sequence (species-dependent) that should bind well with the 3’ end of 16S rRNA (part of the ribosome) • Usually found within 4-18 nucleotides of the start codon of a true gene Shine-Dalgarno Sequence • Is a nucleotide sequence (consensus = AGGAGG) that is present in the 5'untranslated region of prokaryotic mRNAs. • This sequence serves as a binding site for ribosomes and is thought to influence the reading frame. • If a subsequence aligning well with the ShineDalgarno sequence is found within 4-18 nucleotides of an ORF’s start codon, that improves the ORF’s candidacy. Bacterial Promoter -35 T82T84G78A65C54A45… (16-18 bp)… T80A95T45A60A50T96…(A,G) -10 +1 Not so simple: remember, these are consensus sequences Termination Sequences • 3’-U tail • Stem/loop – Inverted repeat immediately preceding the runs of uracil Termination sequence