Download functional - Stanford Computer Forum

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
GGTGCCAGGGAAAGGGCAGGAGGTGAGTGCTGGGAGGCAGCTGAGGTCAACTTCTTTTGAACTTCCACGTGGTATTTACTCAGAGCAATTGGTGCCAGAG
GCTCAGGGCCCTGGAGTATAAAGCAGAATGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCGAAAGACCTGTTGGAGGCTATGAATGC
AATCAAGGTGACAGACAACTGGTGCAATGATGGTAGTGGAAATGGAGGAGAGGGGATTGATTCAAGATGCATTTAGGACCAAGAATCGGGAGCTTGTGAA
CGTGTGTATGAGTACTGTAGACGGAGTGGGTGTGTCATCAGAGAAGATCTGAGCATTTGGGCTTGCTCTCCTCAGAGGCCCTGCGAGTGGAGTTCAGCTT
TTCCTCATGGGGCAAATCTCACTTTCGCTCCAGTTCCTGGGGCTCAGAGTCCCTGGCCCAGATGCCTCTTGCCATCTCATCTTCACCCTGCCTGGCTTCC
CTTGCTTGTTCCAGGATTGTTTCATAAAGAGGGATGTGGTTGGTCTTTAACCCTATGAATGCTGGCTGAGGATGCCTGCGGAACCTGTAGTGAAGCTTTC
AGGGGCTGCTCGGGTTCTGGCTGGTAGGTGAACACTGTCCATCTTGCCGGCTGGGACACAGTGACTCTGGGTAGTTGTGTAAGAGAGGGGCCCTTGGCAG
ACAAACAGGTTCTTCTCTGTTGGTGGGCCAGCCAGCAGGTCAGTGGGAAGGTTAAAGGTCATGGGGTTTGGGAGAACTGGGTGAGGAGTTCAGCCCCATC
CCCCGTAAAGCTCCTGGGAAGCACTTCTCTACTGGGGCAGCCCCTGATACCAGGGCACTCATTAACCCTCTGGGTGCCAGGGAAAGGGCAGGAGGTGAGT
GCTGGGAGGCAGCTGAGGTCAACTTCTTTTGAACTTCCACGTGGTATTTACTCAGAGCAATTGGTGCCAGAGGCTCAGGGCCCTGGAGTATAAAGCAGAA
TGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCT
CTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAG
GAAAGACCTGTTGGAGGCTATGAATGCAATCAAGGTGACAGACAACTGGTGCAATGATGGTAGTGGAAATGGAGGAGAGGGGATTGATTCAAGATGCATT
TAGGACCAAGAATCGGGAGCTTGTGAACGTGTGTATGAGTACTGTAGACGGAGTGGGTGTGTCATCAGAGAAGATCTGAGCATTTGGGCTTGCTCTCCTC
AGAGGCCCTGCGAGTGGAGTTCAGCTTTTCCTCATGGGGCAAATCTCACTTTCGCTCCAGTTCCTGGGGCTCAGAGTCCCTGGCCCAGATGCCTCTTGCC
ATCTCATCTTCACCCTGCCTGGCTTCCCTTGCTTGTTCCAGGATTGTTTCATAAAGAGGGATGTGGTTGGTCTTTAACCCTATGAATGCTGGCTGAGGAT
GCCTGCGGAACCTGTAGTGAAGCTTTCAGGGGCTGCTCGGGTTCTGGCTGGTAGGTGAACACTGTCCATCTTGCCGGCTGGGACACAGTGACTCTGGGTA
GTTGTGTAAGAGAGGGGCCCTTGGCAGACAAACAGGTTCTTCTCTGTTGGTGGGCCAGCCAGCAGGTCAGTGGGAAGGTTAAAGGTCATGGGGTTTGGGA
GAAACTGGGTGAGGAGTTCAGCCCCATCCCCCGTAAAGCTCCTGGGAAGCACTTCTCTACTGGGGCAGCCCCTGATACCAGGGCACTCATTAACCCTCTG
GGTGCCAGGGAAAGGGCAGGAGGTGAGTGCTGGGAGGCAGCTGAGGTCAACTTCTTTTGAACTTCCACGTGGTATTTACTCAGAGCAATTGGTGCCAGAG
GCTCAGGGCCCTGGAGTATAAAGCAGAATGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCGAAAGACCTGTTGGAGGCTATGAATGC
AATCAAGGTGACAGACAACTGGTGCAATGATGGTAGTGGAAATGGAGGAGAGGGGATTGATTCAAGATGCATTTAGGACCAAGAATCGGGAGCTTGTGAA
CGTGTGTATGAGTACTGTAGACGGAGTGGGTGTGTCATCAGAGAAGATCTGAGCATTTGGGCTTGCTCTCCTCAGAGGCCCTGCGAGTGGAGTTCAGCTT
TTCCTCATGGGGCAAATCTCACTTTCGCTCCAGTTCCTGGGGCTCAGAGTCCCTGGCCCAGATGCCTCTTGCCATCTCATCTTCACCCTGCCTGGCTTCC
CTTGCTTGTTCCAGGATTGTTTCATAAAGAGGGATGTGGTTGGTCTTTAACCCTATGAATGCTGGCTGAGGATGCCTGCGGAACCTGTAGTGAAGCTTTC
AGGGGCTGCTCGGGTTCTGGCTGGTAGGTGAACACTGTCCATCTTGCCGGCTGGGACACAGTGACTCTGGGTAGTTGTGTAAGAGAGGGGCCCTTGGCAG
ACAAACAGGTTCTTCTCTGTTGGTGGGCCAGCCAGCAGGTCAGTGGGAAGGTTAAAGGTCATGGGGTTTGGGAGAACTGGGTGAGGAGTTCAGCCCCATC
CCCCGTAAAGCTCCTGGGAAGCACTTCTCTACTGGGGCAGCCCCTGATACCAGGGCACTCATTAACCCTCTGGGTGCCAGGGAAAGGGCAGGAGGTGAGT
GCTGGGAGGCAGCTGAGGTCAACTTCTTTTGAACTTCCACGTGGTATTTACTCAGAGCAATTGGTGCCAGAGGCTCAGGGCCCTGGAGTATAAAGCAGAA
TGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCT
CTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAGACGTGAGCAGGTGAGCAGCTGGGGCTGTCTGCTCTCTGTGCCCAG
Dark Matter of the Human Genome
Gill Bejerano
Dept. of Developmental Biology
Dept. of Computer Science
Stanford University
http://bejerano.stanford.edu
1
This is “the Century of Biology”
http://bejerano.stanford.edu
2
We can now cast Biology in “our” terms
strings
time series
circuits
http://bejerano.stanford.edu
3
We can Harness It / We Can Study It
Bioengineering
Synthetic Biology
one
cell
Embryonic Development
organism
Enter DNA ...
http://bejerano.stanford.edu
4
DNA: Functional and Non-Functional
DNA = linear molecule that carries instructions for making
living organisms ~ long string(s) over a small alphabet
Alphabet of four {A,C,G,T}
Strings of length 104-1011
...ACGTACGACTGACTAGCATCGACTACGACTAGCAC...
“junk” DNA
http://bejerano.stanford.edu
genetic
instructions:
how to...
when to...
where to...
“junk” DNA
5
One Cell, One Genome, One Replication
Every cell holds a copy of all its DNA = its genome.
The genome is replicated every cell division.
The human body is made of ~1014 cells.
All originate from a single cell through repeated cell divisions.
DNA
string
egg
egg
cell
genome =
all DNA
cell
division
chicken
chicken ≈ 1014 copies
(DNA) of egg (DNA)
http://bejerano.stanford.edu
egg
6
Genes = How to make Proteins
gene
DNA
“the workhorses of every living cell”
http://bejerano.stanford.edu
cell
7
DNA Replication is Imperfect
Medium Scale: substrings are duplicated, deleted, inverted
Large Scale: whole DNA strings are duplicated, deleted
junk
functional
...ACGTACGACTGACTAGCATCGACTACGA...
substring
duplication
functional
functional
...ACGTACGACTGACTAGCATCGACTACGA........TCTGACTAGCATCGACTACGA...
functional
divergence
functional’
functional’’
...ACGTACGACTGACTAGCATCGACTACGA........TCTGACTAGCATCGACTACGA...
So...More Genes...More Complexity!...Right?
http://bejerano.stanford.edu
8
1. Gene number does not correlate with Complexity
Gene families are important.
Many are surprisingly old.
But 1014 cells
103
cells
fly
worm
human
weed
fish
rice
pre-genomic era:
“100,000 genes to
the human genome”
# genes
http://bejerano.stanford.edu
9
DNA Replication is Imperfect (contd)
Small Scale: single letters are substituted, erased, added
junk
functional
...ACGTACGACTGACTAGCATCGACTACGA...
chicken
egg
chicken
TT
CAT
...ACGTACGACTGACTAGCATCGACTACGA...
“anything
goes”
many changes
are not tolerated
thus, sequence conservation over generations implies function!
http://bejerano.stanford.edu
10
Sequence Conservation implies Function
Comparative Genomics of Distantly related species:
functional region!
human
...CTTTGCGA-TGAGTAGCATCTACTATTT...
mammalian
ancestor
mouse
...ACGTGGGACTGACTA-CATCGACTACGA...
(but which function/s?...)
http://bejerano.stanford.edu
11
2. Human Genome full of Conserved Non-Coding Elements
Human
Genome:
3*109 letters
1.5%
known
function
>50%
junk
compare to other species
>5% human genome functional
3x more functional DNA than known!
~106 substrings do not code for protein
What do they do then?
[Science 2004 Breakthrough of the Year, 5th runner up]
http://bejerano.stanford.edu
12
Gene regulation = when/where to make protein
recognition site
~10 letters/protein
gene (how to)
control region
(when & where)
DNA
effective region
~103 letters
http://bejerano.stanford.edu
13
Vertebrate Gene Regulation
gene (how to)
control region
effective region ~106 letters!!! (when & where)
DNA
(~103 letters)
http://bejerano.stanford.edu
14
3. Most Non-Coding Elements are likely cis-regulatory
“IRX1 is a member of the Iroquois homeobox gene family.
Members of this family appear to play multiple roles
during pattern formation of vertebrate embryos.”
9Mb
http://bejerano.stanford.edu
15
4. Regulatory regions drive morphological diversity
Gene numbers do not correlate with
organism complexity.
Many gene families are surprisingly old.
“Regulatory sequence evolution must be
the major contribution to the evolution of
form.” [Carroll, Wilson memorial lecture, PLoS Biol, 2005]
fly
worm
human
weed
fish
rice
# genes
http://bejerano.stanford.edu
16
The Writing on the Wall…
gene deserts
regulatory jungles
http://bejerano.stanford.edu
25,000
1,000,000
17
A Computational Question
related elements
(75%id over 200bp)
related genes
human
mouse
rat
same element
96%id over 200bp
same element
95%id over 200bp
Classical Biological approach: experiment to understand these regions
Computational approach: how many similar regions or better are there?
http://bejerano.stanford.edu
18
Ultraconserved Elements
Hundreds of long substrings identical between human-birds
*
Î they must have rejected many different changes.
*
*
But... all functions we understand in our genome are
*
*
encoded using redundant codes.
fish
http://bejerano.stanford.edu
[Bejerano et al., Science 2004] 19
Ultraconserved Elements
Hundreds of long substrings identical between human-birds
*
Î they must have rejected many different changes.
*
*
But... all functions we understand in our genome are
*
*
encoded using redundant codes.
E.g. Protein Coding Genes:
DNA – 108 letters
over alphabet of 4.
Protein – 102 letters
over alphabet of 20.
http://bejerano.stanford.edu
Coding: 3 DNA letters → 1 Protein letter.
[Bejerano et al., Science 2004] 20
Computational Hypotheses
Based on public domain genome wide data:
ultraconserved
elements
one subset
codes protein
larger subset
does not
generate testable hypotheses for function from existing knowledge (2004)
[Pennacchio et al., Nature, 2006]
http://bejerano.stanford.edu
21
Validating Regulatory Elements
wild type
Conserved
Minimal Promoter
Element
Reporter Gene
transgenic
where is the
reporter gene
expressed?
http://bejerano.stanford.edu
where is the
wild type gene
expressed?
22
ultraconserved elements
Origins of Ultraconserved Elements?
http://bejerano.stanford.edu
23
Genomic Distribution of Ultraconserved Elements
•exonic
•non
•possibly
http://bejerano.stanford.edu
24
Uniquely Abundant in Coelacanth
Upto 80%id between Coelacanth instances
and some human instances, inc uc.338.
?
x
100 diverged copies in a Gigabase
60 highly similar copies in a Megabase
http://bejerano.stanford.edu
25
Repeats /
obile Elements ("selfish DNA")
Human
Genome:
3*109 letters
http://bejerano.stanford.edu
1.5%
known
function
>50%
junk
26
The LF SINE (for Lobefin Fish / “Living Fossil”)
not similar to any known repeat
out
Reconstruction
back
target site
duplications
http://bejerano.stanford.edu
27
>360My Old and Going Strong
Upto 80%id between Coelacanth SINE
and some human instances, inc uc.338.
D
?
B
x
http://bejerano.stanford.edu
28
Cis-reg & Ultra elements from
Co-option event,
probably due to
favorable genomic
context
obile Elements
All other copies
are destined to
decay over time
at a neutral rate
[Yass is a small town in
New South Wales, Australia.]
http://bejerano.stanford.edu
[Bejerano et al., Nature 2006]
29
Exapted Into Which Cellular Roles?
No evidence for Transcription (Tx) as small RNAs,
no orientation preference in introns, not in antisense Tx.
?
Human instances cluster together,
found <1Mb from 35 TFs (P<3*10-6).
x
http://bejerano.stanford.edu
30
Instance 500kb Downstream of ISL1
1Mb
ISL1 is a neuro-developmental gene, also expressed in testis.
Three previously known enhancers are conserved across vertebrates.
http://bejerano.stanford.edu
31
Repeat made Regulatory Region
in situ
Conserved
Minimal Promoter
Element
Reporter Gene
transgenic
http://bejerano.stanford.edu
32
Co-option into Different Roles
protein
coding
repeat
gene
regulating
http://bejerano.stanford.edu
33
The Co-Optionome
quantify co-option
transposition
event
?
functional
elements
x
LF-SINE, DeuSINE, MER121, …
http://bejerano.stanford.edu
[Lowe, Bejerano & Haussler, Submitted]
34
From junk DNA to pathway recruitments?
[Davidson & Erwin, 2006]
[Britten & Davidson, 1971]
http://bejerano.stanford.edu
35
Bejerano Lab: Marry Development & Genomics
Origins & Evolution
Functions & Encoding
http://bejerano.stanford.edu
Contribution to
Human Disease
36
Kudos
UC Santa Cruz: David Haussler,
Sofie Salama, Jim Kent, Craig Lowe,
Sol Katzman, Andy Kern, Everyone…
Berkeley Labs (LBNL): Eddy Rubin, Nadav Ahituv
Stanford faculty, and Bejerano Lab:
Gill Bejerano
[email protected]
http://bejerano.stanford.edu
Cory McLean
Sarah Aerni
Phil Lacroute
Shoa Clarke
Abe Bassan
Marina Sirota
CS
BMI
EE
MSTP
DEVBIO
BMI
37
Related documents