Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene Finding Biological Background The Central Dogma Transcription DNA Translation RNA Protein Background *Essential Cell Biology; p.268 Non-coding regions gene regulation Vicinity of TSS: direct interactions with Pol-II complex Larger vicinity – indirect interactions (chromatin remodelling) The Genetic Code Third Letter First Letter Second Letter tRNA – Responsible for Translation Adopted from Genetic Analysis V, p.388 tRNA – Responsible for Translation Adopted from Genetic Analysis V, p.388 Frame Shifts Code Triplets (“codons”) are not overlapping 3x2 possible ways of reading depending on strand and the relative position where reading starts This is not just our concern when looking for genes, it is also the cell’s concern in terms of mutations: Original: THE FAT CAT ATE THE BIG RAT Delete C:THE FAT ATA TET HEB IGR AT Prokaryotes Gene Finding No noclues Most DNA is coding (e.g. 70% in H.influenza) Each gene is one contiunes DNA sequence (no introns) PolyI – rRNA, PolyII – mRNA, PolyIII - tRNA Detecting ORF Simple Idea: If there is no gene encoded then the expected frequency of STOP codon is 3/64 codons ORF – open reading frame, a sequence of codons with no STOP codon Simple Algorithm: 1. scan until you find a stop condon, in all reading frames. 2. Scan back to find a start codon. 3. If it’s long ehough, report this ORF as a putative Cons: gene Can’t detect short genes High FP ( E.Coli has 6500 ORFS but only 1100 genes) Coding vs. Non coding regions Codon frequencies Codon usage in coding regions is different Leucine, Alanine, Tryptophan are coded in 6:4:1 different codons Expect to see a ratio of 6:4:1 in random sequence In proteins the appear in 6.9:6.5:1 ratio Another example: A or T appear in 90% of the case as the last letter of a codon in protein coding regions Nocleutide MM for Gene Detection 2nd Order MM Idea: extend the model to capture codons Results: poor…. Code overlap in this model MM over codons Idea: Transform the code into codons, then use 1rd MM Why not use codon frequencies directly? “Codon Preferences” program: “Codon Preferences” program Uses a window of 25 codons around each point Score: log( P ) 1 P Using Promoter’s Signal We are still far from perfect… idea: try to detect signals in the promoter regions, to help descriminate real genes in ORFs Prokaryotes: ~-35 tss: TTGACA ~-10 tss: TATAAT (“TATA box” signal) No single promoter has the exact consensus Nearly all promoters have 2-3 from TAxyzT 80-90% have all 3 In 50% xyz = TAA Up To here summary We have seen the problems in trying to find genes in wide genome scan – Prokaryotes! The bottom line is that the problem is not really solved, but most research in gene finding focus on Eukaryotes, where the main interest lies … Next lecture – much more sophisticated models, to handle the much more complex situation in Eukaryotes in general, and Human in particular