Download Gene Finding

Gene Finding Biological Background The Central Dogma Transcription DNA Translation RNA Protein Background *Essential Cell Biology; p.268 Non-coding regions  gene regulation  Vicinity of TSS: direct interactions with Pol-II complex  Larger vicinity – indirect interactions (chromatin remodelling) The Genetic Code Third Letter First Letter Second Letter tRNA – Responsible for Translation Adopted from Genetic Analysis V, p.388 tRNA – Responsible for Translation Adopted from Genetic Analysis V, p.388 Frame Shifts  Code Triplets (“codons”) are not overlapping  3x2 possible ways of reading depending on strand and the relative position where reading starts  This is not just our concern when looking for genes, it is also the cell’s concern in terms of mutations:  Original: THE FAT CAT ATE THE BIG RAT  Delete C:THE FAT ATA TET HEB IGR AT  Prokaryotes Gene Finding  No noclues  Most DNA is coding (e.g. 70% in H.influenza)  Each gene is one contiunes DNA sequence (no introns)  PolyI – rRNA, PolyII – mRNA, PolyIII - tRNA Detecting ORF Simple Idea: If there is no gene encoded then the expected frequency of STOP codon is 3/64 codons ORF – open reading frame, a sequence of codons with no STOP codon Simple Algorithm: 1. scan until you find a stop condon, in all reading frames. 2. Scan back to find a start codon. 3. If it’s long ehough, report this ORF as a putative Cons: gene Can’t detect short genes High FP ( E.Coli has 6500 ORFS but only 1100 genes) Coding vs. Non coding regions Codon frequencies  Codon usage in coding regions is different  Leucine, Alanine, Tryptophan are coded in 6:4:1 different codons  Expect to see a ratio of 6:4:1 in random sequence  In proteins the appear in 6.9:6.5:1 ratio  Another example: A or T appear in 90% of the case as the last letter of a codon in protein coding regions Nocleutide MM for Gene Detection 2nd Order MM Idea: extend the model to capture codons Results: poor…. Code overlap in this model MM over codons Idea: Transform the code into codons, then use 1rd MM Why not use codon frequencies directly? “Codon Preferences” program: “Codon Preferences” program Uses a window of 25 codons around each point Score: log( P ) 1 P Using Promoter’s Signal  We are still far from perfect…  idea: try to detect signals in the promoter regions, to help descriminate real genes in ORFs  Prokaryotes: ~-35 tss: TTGACA ~-10 tss: TATAAT (“TATA box” signal)  No single promoter has the exact consensus  Nearly all promoters have 2-3 from TAxyzT  80-90% have all 3  In 50% xyz = TAA Up To here summary  We have seen the problems in trying to find genes in wide genome scan – Prokaryotes!  The bottom line is that the problem is not really solved, but most research in gene finding focus on Eukaryotes, where the main interest lies …  Next lecture – much more sophisticated models, to handle the much more complex situation in Eukaryotes in general, and Human in particular

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene Finding