Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
DARK MATTER IN THE GENOME Shin-Han Shiu Plant Biology / Genetics / EEBB Cell, nucleus, and chromosomes DNA A G G C G T A G A G A G A T C C T T G A T T C C G C A A C T C T C A A G G A A C A A DNA and Genome How many A's, T's, G's, and C's are there in the human genome? 3,038,000,000 letters A sizable book, say, the most recent Harry Potter book ~1,516,000 characters in 758 pages* The book of our life 1,519,000 pages ~1,000 of the Deathly Hallows How fast can you read? Say, 1/day, take about 3 years No vacation, no social life, no going anywhere else. And the worse of all, it looks like this... TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG Our research interest TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG Now, 1366 genomes are sequenced or being sequenced Evolution of genome sizes In Mb Thale cress (Arabidopsis thaliana): 150 Fruit fly (Drosophila melanogaster): 160 Pufferfish (Takifugu rubripes): 400 Human (Homo sapiens): 3,000 Onion (Allium cepa): 16,750 Tiger salamander (Ambystoma tigrinum): 32,000 Marbled lungfish (Protopterus aethiopicus): 132,000 http://www.rbgkew.org.uk/ What's in the genome Genome Exon UTR Annotated genes Intron Cis-regulatory elements Selfish elements Dead genes (pseudogenes) Novel genes "Non-genic": repetitive elements E.g. Human genome Exons take up? Introns account for? Repetitive elements occupy? Unknown? A B C 1% 24% 25% 24% 1% 25% 35% 60% 45% 40% 15% 5% Venter et al. (2001) Science 291:1304 What are in the unknown regions? Investigate with tiling array cDNA array Tiling array Gap size: 10bp Probe size: 25bp Number of features: Arabidopsis, 135Mb, 1 chip, ~6x106 features Human, 3Gb, 7 chips, ~4.2x107 features "Non-genic": unannotated genes Tiling array analysis of human Chr 21, 22 Kapranov et al., 2002. Science Tiling array analysis of human transcriptome Human Chr 21, 22 What do you think these expressed regions represent?? Kapranov et al., 2002. Science Difficulties for coding gene prediction Training data You need to know something... “Biased” toward the properties of the majority. Real genes that are shorter tend to be much harder to predict. Table 3 Accuracy of GISMO, Glimmer and CRITICA in predicting short genes (<300 bp) Gene finder Cor Sn Snfk (%) Sp GISMO 0.64 63.0 86.4 69.0 Glimmer 0.54 72.0 83.7 44.0 CRITICA 0.60 46.0 67.4 84.0 Snfk denotes the sensitivity in detecting function-known genes. Krause et al., 2006. Nucleic Acid Res. 35:540 Novel coding sequence identification Arabidopsis thaliana as an example 135Mb, ~50% occupied by annotated genes. Focus on coding sequences 90-300bp long. What would you do next to eliminate ORFs that are likely false predictions? 133,090 sORFs Criterion 1: Codon usage bias Some codons are used more frequently than others http://www.cbs.dtu.dk/services/GenomeAtlas/ Criterion 1: Codon usage bias For example: codons for proline NCDS CDS CCT 0.25 0.12 CCC 0.25 0.49 CCA 0.25 0.06 CCG 0.25 0.33 Suppose you have the following 2 sequences both code for poly-leucine, which one is more likely to be real coding sequence? Seq1 CCT CCA CCT p(CDS | Seq1) 0.12 0.06 0.12 8.6 104 Seq2 CCC CCG CCC p(CDS | Seq 2) 0.49 0.33 0.49 7.9 102 Posterior probability of coding sequence Compare known non-coding and coding sequences Hanada et al., 2007. Genome Res. Posterior probability of coding sequence Scanning Arabidopsis genome Hanada et al., 2007. Genome Res. After applying the first criterion 7,442 coding sORFs How good is the CDS finding measure For the training data For 18 Arabidopsis small protein genes All 18 are predicted as CDS. For 84 yeast small protein genes All 84 are predicted as CDS. So what does this mean? If a sequence is a true coding sequence Our approach can predict them with high accuracy. So, the sensitivity is very good. Is this good enough?? What about specificity? Namely, how good is the criteria in excluding false positives? Criterion 2: Expression Which of the following distribution more likely depicting the expression level distribution of true CDS compared to that of false CDS? Tiling array Frequency Gap size: 10bp Probe size: 25bp Expression level Low High Comparison of expression levels Exon, intron, tRNA, rRNA, our predictions A: Exon B: Intron C: Prediceted novel CDS D: tRNA E: rRNA Applying the second criterion Prediction significantly enriched in expressed sequences 2,996 transcribed sORFs Criterion 3: Purifying selection Compare known coding and non-coding sequences Ka w Ks K a : non - synonymous substituti on rate K s : synonymous substituti on rate w 1 : negative (purifying ) selection w 1 : selectivel y neutral w 1 : positive selection Criterion 3: Purifying selection Compare known coding and non-coding sequences In the end, We found a large number (941) small ORFs have the following three properties: They have nucleotide composition similar to known coding sequences. They are expressed. They are subjected to selection in a fashion a protein sequence would be selected. Take home message: We don't know the functions of all any of these. The view that most of the "intergenic" region is junk DNA may be wrong. Acknowledgement Current and past lab members Kousuke Hanada Melissa Lehti-Shiu Cheng Zou TIGR Chris Town Hank Wu University of Chicago Wen-Hsiung Li Justin O. Borevitz Xu Zhang Funding