Download Genome Dark Matter v.2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
DARK MATTER IN THE GENOME
Shin-Han Shiu
Plant Biology / Genetics / EEBB
Cell, nucleus, and chromosomes
DNA
A
G
G
C
G
T
A
G
A
G
A
G
A
T
C
C
T
T
G
A
T
T
C
C
G
C
A
A
C
T
C
T
C
A
A
G
G
A
A
C
A
A
DNA and Genome

How many A's, T's, G's, and C's are there in the human
genome?
3,038,000,000 letters

A sizable book, say, the most recent Harry Potter book
~1,516,000 characters in 758 pages*

The book of our life
1,519,000 pages
~1,000 of the Deathly Hallows

How fast can you read?
Say, 1/day, take about 3 years
No vacation, no social life, no going anywhere else.
And the worse of all, it looks like this...
TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA
TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT
CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA
GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA
AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA
ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT
GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA
TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG
AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG
GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC
TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT
CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT
CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA
TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT
TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG
GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT
TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC
TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT
AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA
AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC
ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC
AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA
TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG
GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG
TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA
ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT
GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT
CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG
AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA
GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT
AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG
CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA
AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA
ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA
CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA
TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT
ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA
ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG
AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA
CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT
AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT
GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA
AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT
TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT
GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA
TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG
CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC
GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA
TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT
AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT
CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG
Our research interest
TTGGCTATCCTTTATATTTTAAGGGTTATTAGGATATTTTTTATTATGACTACATGGGATAAATGTTTAAAAAAAATAAAAAAAAACCTTTCTACGTTTGAGTATAAGACGTGGATAAAGCCTATCCATGTGGAGCAAAA
TAGTAACTTATTCACAGTTTACTGTAACAATGAATATTTCAAAAAACATATAAAATCTAAGTATGGAAATCTTATTTTATCAACAATCCAAGAGTGTCATGGTAATGATTTAATTATTGAATATTCTAATAAAAAATTCT
CTGGCGAAAAAATTACTGAGGTTATCACAGCTGGACCACAAGCTAATTTTTTTAGCACAACAAGTGTTGAGATAAAAGATGAATCAGAAGATACAAAAGTAGTACAAGAACCTAAAATATCAAAGAAGTCTAATAGTAAA
GACTTTTCTTCATCACAAGAGTTATTCGGTTTTGACGAAGCTATGCTAATTACAGCAAAAGAAGATGAGGAATACTCTTTTGGTTTACCGTTAAAAGAAAAATATGTTTTTGATAGTTTTGTTGTTGGAGATGCTAACAA
AATTGCTAGAGCAGCGGCTATGCAGGTATCGATAAATCCAGGTAAATTACATAACCCTTTATTCATTTATGGTGGTAGTGGTTTAGGTAAAACTCACTTAATGCAAGCAATAGGTAATCATGCAAGAGAAGTTAATCCTA
ATGCCAAAATTATTTATACAAATTCAGAACAATTTATTAAAGATTATGTAAATTCTATTCGTTTACAAGATCAAGATGAGTTTCAAAGAGTTTATAGATCTGCGGATATACTTTTGATTGATGATATTCAATTTATCGCT
GGTAAAGAGGGTACTGCTCAGGAGTTTTTCCATACTTTTAATGCATTGTATGAAAATGGTAAACAGATAATTCTAACTAGTGATAAGTATCCAAATGAAATAGAAGGGCTTGAAGAAAGACTAGTTTCGCGTTTTGGTTA
TGGTTTAACAGTTTCTGTTGATATGCCAGATTTAGAAACCAGAATTGCTATCTTGCTCAAAAAAGCTCATGATTTAGGTCAGAAATTACCTAACGAAACAGCAGCTTTTATTGCTGAGAATGTACGTACTAATGTCAGAG
AACTAGAAGGTGCTCTAAATAGGGTTCTTACTACCTCTAAATTTAATCATAAAGATCCTACTATCGAAGTAGCACAAGCTTGCTTAAGAGATGTTATAAAAATACAAGAAAAGAAAGTAAAAATAGATAATATCCAAAAG
GTTGTTGCTGATTTTTATAGAATCAGGGTAAAAGATTTAACTTCTAATCAAAGAAGTAGAAATATAGCTAGACCAAGACAGATAGCAATGAGTTTAGCACGTGAACTAACATCACATAGTTTGCCAGAAATAGGCAATGC
TTTTGGTGGTAGAGACCATACGACAGTTATGCATGCTGTCAAAGCTATAACTAAATTAAGACAAAGCAATACTTCAATATCGGATGATTATGAGTTGCTTTTAAATAAAATTTCTCGTTAAATAAAATTAGTAACTTTAT
CAAAGGGGTTTTAAAAAATGAATTTTGTACTAAATAGAGATGACTTACTAAAGCCTTTGCAATCTATGCTCTCAGTTGCAAATAGTAAGAGTACAATGCCTTTATTATCATGTATCTTATTTGATATTGATAATAATAAT
CTCAAAATTACGGCTTCGGATCTTGATACAGAGATATCATGCAATATAGCAGTTAGTTGTAACACAACTATTAAGTTAGCATTAAATGCTGACAAAATTTATAACATTGTCAGAAGCTTAAATGAAAATTCAATGATTGA
TTTTAGAATTAATGAAAATAAGGTAACTATTGTTTCTAATAATAGTACTTTTAACCTTATATCACTAAATGCTGACAACTATCCTCTTATTGATAGTAATATCAATGAGCAAGCAAGTTTTGATCTTTCTCAACAAGATT
TTCATCATATTATTTCAAAAGTAGATTTCTCAATGGCTAATGATGATACTCGATATTTCTTAAATGGGATGTTTTGGGAAATCAACGCAAATCTACTAAGAGCAGTATCTACAGATGGTCATAGAATGTCTATCACAGAG
GCTATAATTGATAGTAAAGTGTTAGATAGTGCTTCTCAGTCGATAATTCCAAAAAAAGCGATTTTAGAGCTTAAAAAGATAGTTGGCAAAACAGAAGAAAATATCAAAATTTGTCTTGGCAAAAATTATCTAAAAGCGAT
TTTTGGTAATTATGCTTTTATATCAAAGCTTATAGATGGTCGCTATCCTGATTACCAAAAAGTAATCCCTAAAAATAATACAAAACTATTAGCAGTTGATAAGCAGTTTTTCAAAAATTCATTATTAAGAACATCAATAC
TTGCTAATGATAAATATAAAGGTGTTCGTCTTAACATATCTCAAAATCAATTACTTCTATCAGCTAATAACCCTGATAATGAAAAGGCTGAAGATAAAATCGAAGTTCAATATAATGATCAACCAATGGAAATTTGTTTT
AATTACAAATATCTTTTGGATATTATAAATGTACTTAGTGAAGAAACTATGTCTATCTACCTTGATAATCCAAATATGAGTGCTTTAGTTAAAGATGAGAAAGATAATAGTTTGTTTATTATTATGCCAATGAAAATTTA
AGTAATAAGTAGTTTTAGGAAATAACTATTTTTATAAGCCTTTTGGAATGAATAATAAAGCAATAAAAAAAGGTATGCATAAAAACATTATATAGAAAGCTGGGATTAGATAATTTCCAGTAGTAGTAATTAATAAAGTC
ATAAGAAAGGCAACAGTACCTCCAAATAAAGAAACGCTTATATTAAAACTTATTCCAAAACCAGTATTTCTATTTCTAACAGGAAATAGCCCTGCCGTATTTGCAAATATAGGTCCTATTACAGCACCACTTATAATAGC
AAGAGAAAAAATAGCTATAGATACTAATTGATGATTTTTTATAATAATTTGGTATATTGGTAAAACAGCTATAAATAAGACTATACAAGAATACATCAGAACTTTTTTACCACCAATTCTATCAGCAATATATCCAAATA
TAATTGAAGAAAGCATTAATACTATAGTTAATCCGAGAGTATTTTGTTAAACACATAAGAGATAGCATCACAAAATTTTTCAAAACTATTATTCACTTTTCTAAATATTTTTTTAAAGTTAGCCCAAACCTTTTCAATAG
GATTTAAATCTGGAGAATACGGAGGTAGATATAATATTTGTACATCAAATTTATTGGCTATTTCAATCAGCTTAGAGGATTTATGGAAACTAGCATTATCCATTACTATAGTAGTTTTAGGTTTTAATGATGGGCATAAG
TGTTCCTCAAACCATTGATTAAAAATTTCAGTATTGGTATATCCACTGTACTCTAATGGAGCTATAATCTTTTTATCTGCATAATTATATCCAGCAACAATACTTCTTCTTTGTGTTTGATATGCTAAAACCTCACCATA
ACTAGGCTCACCAATTAGTGACCATCCTCTTAGGATAGAAAGCTTATTGTCACACCCCATCTCATCTATATAAAATAACAAGTTTTGAGCTATTTCTTTTAGTTTTTCTATATACTCCAACCTTTCATGTTCTTTTCTTT
GCTTATATTTTGGAGTCTTTTTTTAAAACTAAAACCAAGTCTATTAAGACAATCATAAAATGTACTTCTTGGAATATCAGGGGCTAATGCTTCTTTTATATCTAATGCACTTGCATCTGGATGATCTATCAAATACTGTT
CAATCAATGTTTTATCGGTAAAGCTAGCGACTCTGCCACAACCAACTCCTTGCTTTGAACTATAATCTCCGGTTCTTTTATAAAACTCTATCCATGAAACAACTGTACGCTTATCTATGTTAAAAAACTTACTCAGCTCG
AACTCCGTCATACCTTCTTCATATTTATTAATTACGATGTCTCTAAAATATTGGCTATATGATGGCATTTTTATTAGACATTATAACATTTCTACAAATATCTTTTTCTACAAATATCTTTCGGATTAACTATATAAGTA
GAGTCAACAACCATCCAAATCACCCAATTATCTATAATTTTCTGCTTGCTAAAAAAACGCATACCAATGATGCTACACTTGTAAAACCATCCATATATGGCGTTGTTGAATCAGTATAAAATATAAGTAGTTGCGAAACT
AGTAACCCAAATACTACAATGCTTACTAGAACTTTTAACCAACCAATGATTTTAAGTCTATGAACAACTATCTTTTTGTGACTAAAGTTGGGTTGCCAACTATACCAACCGTATCCAAAGCTAAATAAAAGAATCATCTG
CAATATAGCATCGGCATATAGTCCACTAACAGAAAATAAACCCGCACTCATGATCAAACCAACTATCTCCACAGGCCAACCAATGACATAAAGCCTTGCCAGCAAAAAGGTACACAAAAGATTAACAATCATTGTACAAA
AATCAAAAATATGCAGCATATTTATTTTACTAATCAAAGTATTATAAATATTATAATAACTTTGAAGTTGGCGTATTAAAGCCATAAACTTTAGTAGGTTAGTGTTTATACCAATATTTTGAGATGCTTTCTGCAAGCTA
ATAACATTTAGCTATCTAGCCTAAATAATTAATATACAAAACTTTCAAGCTTATTGAATTTTTCAACAGATACAGCGCGTTATAACAAATAAGTAATTGACTAAATTAAAAAGCAAGTATAATATCGATTGTGTTTATTA
CATAATATAAAACGAGGATAAAAAAAATATGAAATTAAGAAAAGTATTAATCGCGACATTATTAGGAGCTTCTGCTTTATCTTTAAGTAGTTGTTGGTTACTTGTTGGTGCAGCTGTTGGTGGTGGAACTGCTGCGTATA
TTTCTGGTGAGTATTCAATGAATATGAGTGGCAGTGTAAAAGATATTTACAATGCTACTTTAAAAGCTGTTCAAAGCAATGATGATTTTGTAATTACTAAAAAATCTATTACTTCTGTTGATGCAGTTGTTGATGGTAGT
ACTAAGGTAGACTCAACAAGTTTCTATGTTAAAATAGAAAAACTTACTGATAATGCTTCAAAAGTTACAATTAAGTTTGGTACTTTTGGTGACCAAGCAATGTCAGCAACATTAATGGATCAAATCCAAAAGAATCTTTA
ATTAAATAGGTAATTACTATAATGACTTTTCTAAAGAAAGCTTTTATTGCAACTATAGTTTCTATTTCAGCATTAGTTCTAAATAGTTGTATTGTTGCAGCAATAGCTGTTGGTGGTGGAACAGTTGCCTATATTGATGG
AAATTATTTTATGAATATAGAAGGCAACTATAAAGCTGTCTATAAAGCTACTCTTAAAGCTATTAATGATAATAATGACTTTGTTCTAGTATCAAAAGATCTTGATCAAACAAAGCAAAATGCCGACATTGAAGGTGCTA
CTAAAATTGATAGTACGAGTTTTAGTGTCAAAATTGAAAGACTGACAGATCAGGCTACTAAAGTGACAATCAAATTTGGTACTTTTGGCGATCAAGCAATGTCATCAACATTAATGGATCAGATCCAGGCAGCTGTACAT
AAAGCTTCTTAGAAATGTACAAAAAACTCTACTTAATTATATTATCCACAATAATCGCAATCTCTCTTAATAGTTGTGTTGTTGCCGCTGTTTTAATTGGTACAGCAGTTGTTGCTGGAGGTACAGTATATTACATCAAT
GGTAACTATATAATCGAAGTCCCTAAAGATATTAGAAGTGTATACAATGCTACAATCAAGACTATACAGATGGATAGTCAAAATAAACTAATAAGTCAAACCTATAATACTAAATCTGCTATAATTAAAGCTTTACAAAA
AGGTGAAAAAATTAGTATAGATTTAAGCAATATTGATAGTCGTTCAACAGAGATAAAAATTCGTATAGGTGTACTTGGCGATGAGAAAAAATCTGCTGATTTAGCAAACTCAATAACAAAAAATATCACCTAAGCAATAT
TTCTCGAACTTTGGTTAACTTTTTCTTTTTAAAAACTTTCAAAAATGTATAATTTGTGTTAGTTTGCAAACTACCCTTATATCCATAATGAGTAATAAGGTATTAGATACATATTATAAAAACAATCGACATATTTGGGT
GCTAGTACTATCTGGTGCTGTTATAGGCACAATGATTGGTCTTCTAGCAACAGCATTTCAGCTACTCCTAGACTTTATTTTTAAAATTAAGCTGGCTCTTTTTTCTTTCAGTGGTGGTAATCTTTTTATCGAAATCGCTA
TGTCAATATCATTAAGTATTGTGATGGTATTAATTTCGATTTTTATTGTTAAAAAATTTGCGAAAGAGGCTGGTGGTAGCGGTATCCAAGAGGTTGAGGGTGCTTTAAAAGGCTGCCGCAAAATACGTAAAAGAGTTATG
CCCGTGAAGTTTATAAGTGGACTTTTTTCGTTAGGCTCAGGTTTAAGTTTAGGTAAAGAGGGACCATCAATTCATATGGCTGCTGCATTAGCGCAGTTTTTTGTTGATAAATTTAAACTTACTACAAAATATGCTAATGC
GGTTATCTCTGCTGGGGCTGGAGCTGGACTAGCAGCTGCTTTTAATACCCCACTTTCTGGGATTATCTTTGTTATTGAAGAGATGAATAGAAAGTTTAGATTTAGTGTTTCGGCAATAAAGTGTGTGCTAGTAGCATGTA
TCATGAGTACAGTTATCTCTAGAGCTATTATGGGTAATCCTCCAGCAATACGCGTAGAAACTTTCAGCTCAGTACCACAAAATACTCTTTGGTTATTTATGGTATTAGGGATTATATTTGGTTATTTTGGTTTACTATTT
AACAAATCCTTAATCAAAGTGGCAAACTTTTTCTCAGAAGGCTCCAAGAAGAGGTATTGGACTTTAGTTATAATTGTTTGCATAATTTTTGGTATTGGTGTTGTTCTATCTCCAAATGCTGTTGGCGGTGGCTATATTGT
CATAGCAAATACTCTTGATTATAACTTATCAATCAAGATGCTTTTAGTGCTTTTTGTACTTCGTTTTGCTGGAGTTATTTTCTCATATGGCACCGGCGTTACTGGTGGGATATTCGCACCAATGATTGCGCTTGGTACTG
Now, 1366 genomes are sequenced or being
sequenced
Evolution of genome sizes

In Mb

Thale cress (Arabidopsis thaliana):
150

Fruit fly (Drosophila melanogaster):
160

Pufferfish (Takifugu rubripes):
400

Human (Homo sapiens):
3,000

Onion (Allium cepa):
16,750

Tiger salamander (Ambystoma tigrinum):
32,000

Marbled lungfish (Protopterus aethiopicus):
132,000
http://www.rbgkew.org.uk/
What's in the genome
Genome
Exon
UTR
Annotated genes
Intron
Cis-regulatory elements
Selfish elements
Dead genes (pseudogenes)
Novel genes
"Non-genic": repetitive elements

E.g. Human genome
 Exons take up?
 Introns account for?
 Repetitive elements occupy?
 Unknown?
A
B
C
1%
24%
25%
24%
1%
25%
35%
60%
45%
40%
15%
5%
Venter et al. (2001) Science 291:1304
What are in the unknown regions?
 Investigate with tiling array
cDNA array
Tiling array
Gap size: 10bp
Probe size: 25bp
 Number of features:
 Arabidopsis, 135Mb, 1 chip, ~6x106 features
 Human, 3Gb, 7 chips, ~4.2x107 features
"Non-genic": unannotated genes

Tiling array analysis of human Chr 21, 22
Kapranov et al., 2002. Science
Tiling array analysis of human transcriptome

Human Chr 21, 22

What do you think these expressed regions represent??
Kapranov et al., 2002. Science
Difficulties for coding gene prediction
 Training data
 You need to know something...
 “Biased” toward the properties of the majority.
 Real genes that are shorter tend to be much harder to
predict.
Table 3 Accuracy of GISMO, Glimmer and CRITICA in
predicting short genes (<300 bp)
Gene finder
Cor
Sn
Snfk (%)
Sp
GISMO
0.64
63.0
86.4
69.0
Glimmer
0.54
72.0
83.7
44.0
CRITICA
0.60
46.0
67.4
84.0
Snfk denotes the sensitivity in detecting function-known genes.
Krause et al., 2006. Nucleic Acid Res. 35:540
Novel coding sequence identification

Arabidopsis thaliana as an example
 135Mb, ~50% occupied by annotated
genes.
 Focus on coding sequences 90-300bp
long.

What would you do next to eliminate
ORFs that are likely false predictions?
133,090 sORFs
Criterion 1: Codon usage bias

Some codons are used more frequently than others
http://www.cbs.dtu.dk/services/GenomeAtlas/
Criterion 1: Codon usage bias


For example: codons for proline
NCDS
CDS
CCT
0.25
0.12
CCC
0.25
0.49
CCA
0.25
0.06
CCG
0.25
0.33
Suppose you have the following 2 sequences both code for
poly-leucine, which one is more likely to be real coding
sequence?
Seq1
CCT CCA CCT
p(CDS | Seq1)  0.12  0.06  0.12  8.6 104
Seq2
CCC CCG CCC
p(CDS | Seq 2)  0.49  0.33  0.49  7.9 102
Posterior probability of coding sequence

Compare known non-coding and coding sequences
Hanada et al., 2007. Genome Res.
Posterior probability of coding sequence

Scanning Arabidopsis genome
Hanada et al., 2007. Genome Res.
After applying the first criterion
7,442 coding sORFs
How good is the CDS finding measure

For the training data

For 18 Arabidopsis small protein genes
 All 18 are predicted as CDS.

For 84 yeast small protein genes
 All 84 are predicted as CDS.
So what does this mean?

If a sequence is a true coding sequence
 Our approach can predict them with high accuracy.
 So, the sensitivity is very good.

Is this good enough??

What about specificity?
 Namely, how good is the criteria in excluding false positives?
Criterion 2: Expression
 Which of the following distribution more likely
depicting the expression level distribution of true CDS
compared to that of false CDS?
Tiling array
Frequency
Gap size: 10bp
Probe size: 25bp
Expression level
Low
High
Comparison of expression levels

Exon, intron, tRNA, rRNA, our predictions
A: Exon
B: Intron
C: Prediceted novel CDS
D: tRNA
E: rRNA
Applying the second criterion

Prediction significantly enriched in expressed sequences
2,996 transcribed
sORFs
Criterion 3: Purifying selection

Compare known coding and non-coding sequences
Ka
w
Ks
K a : non - synonymous substituti on rate
K s : synonymous substituti on rate
w  1 : negative (purifying ) selection
w  1 : selectivel y neutral
w  1 : positive selection
Criterion 3: Purifying selection

Compare known coding and non-coding sequences
In the end,

We found a large number (941) small ORFs have the
following three properties:
 They have nucleotide composition similar to known coding
sequences.
 They are expressed.
 They are subjected to selection in a fashion a protein sequence
would be selected.

Take home message:
 We don't know the functions of all any of these.
 The view that most of the "intergenic" region is junk DNA may
be wrong.
Acknowledgement

Current and past lab members
Kousuke Hanada
Melissa Lehti-Shiu
Cheng Zou

TIGR
 Chris Town
 Hank Wu

University of Chicago
 Wen-Hsiung Li
 Justin O. Borevitz
 Xu Zhang

Funding
Related documents