Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Exome sequencing wikipedia , lookup

Microsatellite wikipedia , lookup

Helitron (biology) wikipedia , lookup

Transcript
Computing non-coding
cis-regulatory DNAs
Michele Markstein
IEEE CSB 2003
Stanford University
August 11, 2003
[email protected]
OUTLINE
(first-half)
1.Brief Review of Central Dogma (DNA->RNA-> Protein)
base-pairing, gene architecture, transcription, translation
2. Landscape of the Human Genome
3. Cis-regulation
Enhancers, Insulators, Chromatin Boundaries
BASE PAIRING
DNA serves as a template for DNA and RNA
P
The Building Block of DNA
is the NUCLEOTIDE
5’
S
3’
P
S
A
T
S
BASE
5’
3’
P
P
S
C
G
S
P
P
S
T
A
3’
S
P
P
S
3’
C
G
5’
Template
Strand
S
P 5’
Template Strand
Gene Architecture and the Central Dogma
exon 1
TATA
exon 2
exon 3
intron 2
intron 1
DNA
Transcription
mRNA
splicing
Mature mRNA
Introns stay in the nucleus
exons exit the nucleus
AUG
Nucleus
UAA
Translation
protein
protein folding
Cytoplasm
Another View of Exon/Intron Structure
Exon 1
Intron 1
Exon 2
Intron 2
Exon 3
GGGTGTTTCCAAAAATACTCGGGTGTTTCCAAAAATACTCGAGTGGTCTCGTAGGTAGTGA
GTCAAATGGCGCCATACATAATGATTGTTGAGTTCTTGTGTCTTTGGTCCAGTGTCTCGGC
TGTTAATTGCGTCTGTTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAGT
CCAAAGGAAAAGGTCACAATAATGGCGAAGCGGCTGATTTCGTTAAAAATTTTTACCCTTC
ATTTCTTATACCCGTCACGCTTCCACCCATACAAATTTTAGGCGTACAAAAAATGACCAGA
GAACTGCAGCCCGCATACAAAAAATGACCTGCGGCAGATCGTTGACTGTGCGTCCACTCAC
CCATACGGCTCTTGCGCAGCAGGCCTCGGGTGGTTTTTTTACTAGTAAATTGCCCCGCCCC
CCAACGGTTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCAAAGGA
AAAGGTCACAATAATGGCAGAAGCGGCTGATTTCGTTAAAAATAAAATTAACAATGGAACA
TACTCAGTTGCCAATAAACATAAAGGAAAAAGTGTTATTTGGTGCATTTTATGTGACATTT
TAAAGGAAGATGAAACTGTTCTGACGGATGGCTGCAGCCCGCATACAAAAAATGACCTGCG
GCCGATCGTTGACTGTGCGTCCACTCACCCATACGGCTCTTGCGCAGCAGGCCTCTTGCGC
GTCAGGCCTCGTACATAATGATTGTTGAGTTCTTGTGTCTTTGGTCCAGTGTCTCGGCTGT
TAATTGCCCTTTGTACGATGCAATTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCA
AAGGAAAAGGTCCCAAAACACAATAATGGCGAAGCGGCTGATTTCGTTAAAAATTCCCTAC
CCTTCATTTCTTATACCCGTCACGCTTCCACCCATACAAATTTTAGGCGTACAAAAAATGA
CCACAATAATGGCAGAAGCGGCTGATTTCGTTAAAAATAAAATTAACAATGGAACATACTC
AGTTGCCAATAAACCAGAGAACTGCAGCCCGCAGGTGGTTTTTTTACTCGTAAATTGCCCC
ACGATGCAGTTACTAGCTTGTTAGGATTCAGTATTATTTGGAAGCCAAAGGAAAAGGTCAC
AATAATGGCAGAAGCGGCTGATTAGGTTAAAAATAAAATTAACAATGGAACATACTCAGTT
GCCAATAAACATAAAGG
E1
E2
E3
Snap-shot of RNA transcription
Puzzle: how do you translate a
4-letter alphabet into a 20-letter alphabet?
nucleotides
amino acids
The Triplet Code
64 combinations
Each triplet is called
a Codon
The “Genetic Code”
codons
amino
acids
amino-acid
Pro
Gly
generic
tRNA
anti-codon
1
mRNA
2
3
GGU
CCU
GG A C CA U U U
1
2
3
The Ribosome sets the reading frame
Met His
U A C G UA
G G A C C A U U U C A U G C A U C A U G GG A A A G C
Anatomy of mRNA
5’ UTR
UTR=
untranslated
region
UAA 3’ UTR
AUG
translation
Protein
mRNA is composed of EXONS
not all of the mRNA necessarily serves
as template for protein synthesis
(hence 5’ and 3’ UTRs)
therefore not all EXONS or parts of
EXONS necessarily serve as template
for protein synthesis
mRNA
The Human Genome estimated to
have 25,000 – 30,000 genes
Estimate of 100,000 genes was a
“back of the envelope” guess by a
Harvard Professor in the mid-80’s
gene = 30,000 bp
genome = 3 billion bp
Table from Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature
2001 Jun 7;411(6838):720.
PMID: 11237011 [PubMed - indexed for MEDLINE]
Copied from NCBI
Genome size does not correlate with complexity
YEAST
9
.012 X 10
~5,500
genes
HUMAN
9
3 X 10
~30,000
genes
AMOEBA
9
600 X 10
?
1-2 % of the human genome encodes proteins
50%
REPEATS
25%
GENES
15% 10%
?
H
exons
introns
cis-regulation?
H = largely unsequenced
heterochromatin
The human genome is AT- rich
G + C content = 41%
CG
CG di-nucleotides expected at
frequency of
.2 X .2 = .04
BUT, observed only 1/5 as frequently as
expected
Why? CG is often methylated, and
spontaneous de-amination converts the C to T
CpG islands
associated with the beginning of genes
CG
From: Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature
2001 Jun 7;411(6838):720.
2 Major Classes of Repeats:
1. Transposons
2. Simple Repeats
45% of our genome
3% of our genome
(A)n or (CA)n or (CGG)n where n=1 to 11 generally
microsatellites—exhibit great variation
Junk or “rich paleontological record” ?
1 in 600 mutation in humans are due to transposons
10% of mutations in mouse due to transposons
Why?
4 TYPES OF TRANSPOSONS
LINES = long interspersed repeats (L1 still active)
SINES = short interspersed repeats (ALU sequences)
Diagram from Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun 7;411(6838):720.
PMID: 11237011 [PubMed - indexed for MEDLINE]
LINES = long interspersed repeats (L1 still active)
spreads by “copy & paste”
1
2
DNA
mRNA
Cell nucleus
Cell cytoplasm
mRNA
Full-length LINE = 6kb
1. Reverse Transcriptase
encodes 2 ORFs
about 60-100 LINES still mobile
New L1 Jump in every 10-250 people born

2. endonuclease
SINES — do not encode proteins
They take advantage of LINE’s machinery to move
Retrovirus-like transposons
like LINES except they make the double-stranded RNA
in the cytoplasm. Encode 2 proteins: Reverse Transcriptase and Integrase. HIV and other
Retroviruses have 2 extra genes: coat protein and envelope protein
DNA Transposons
A dying breed. They require virgin genomes to survive because they don’t have the advantage
of “cis-preference”.
CREATIVE or DESTRUCTIVE FORCE?
3’ tranduction
—LINEs have a tendency to transcribe DNA beyond their 3’ end
and thereby move host DNA
MER85
5’
MER85
3’
ORF
Novel protein
1.7 kb
Closest sequence is the insect piggyBAC transposon
Expressed in fetal brain and cancer cells
Maintained for 40-50 Myr
Other candidiates: intronless genes
Most LINES found in AT-rich, gene-poor regions:
they integrate at TTTT/A
Alus accumulate in GC-rich gene-rich regions!
Why?
Increased loss at AT regions?
Selective benefit to retaining
Alus near genes?
May be used in the stress
response to mediate QUICK
responses; e.g. they have been
shown to promote translation
Graph from Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug
2;412(6846):565. Nature 2001 Jun 7;411(6838):720.
PMID: 11237011 [PubMed - indexed for MEDLINE]
Alu sequences evenly spread out across most
chromosomes (exception is Chr.19)
Graph from Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature
2001 Jun 7;411(6838):720.
PMID: 11237011 [PubMed - indexed for MEDLINE]
Gene Regulation
Odorant
receptor
(neurons)
Drosomycin
anti-microbial peptide
(liver, secreted into blood)
Genomic Equivalence
All cells have the same DNA but they
express only a subset of available genes
Berkeley Drosophila Genome Browser at www.fruitfly.org
Gary Felsenfeld* & Mark
Groudine†
NATURE | VOL 421 | 23 JANUARY
2003 | www.nature.com/nature
also in Albert’s Textbook
Molecular Biology of the Cell
simplified anatomy of a gene
Slide from Mike Levine
Changes in regulatory DNA cause
changes in morphology
Slide from Mike Levine
in vivo assay for enhancer activity
Slide from Mike Levine
Regulatory DNA is modular
Slide Courtesy of
Mike Levine
Enhancers can also be intronic
THE EXPERIMENT:
Above are the results of an in situ
hybridization. This in situ shows mRNA
localization in fly embryos. The embryo on the
left shows sog mRNA in blue. The embryo on
the right shows lacZ mRNA in blue. Both
patterns are about the same--thus indicating
that the dorsal cluster is sufficient to drive the
sog pattern of expression
A 263 bp cluster of
Dorsal binding sites
in the intron of a
gene called “sog” was
cloned and fused to a
lacZ reporter. This
fusion construct was
injected into the fly
germline to make
transgenic flies.
Markstein et al., Genome-wide analysis of clustered Dorsal binding sites identifies putative target genes
in the Drosophila embryo. Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):763-8. Epub 2001 Dec 18.
Gene Regulation: Trafficking Problem
Gene Regulation: Trafficking Problem
Promoter
competition
Tethering
Element
Insulator
Butler and Kadonaga
Genes and
Development 2002
Gene Regulation: Trafficking Problem
Promoter
competition
genomebiology .com/2002/3/12/rese
http://
arch/0087.1
comment reviews reports deposit ed
researchinteractions information refereed
research
Human:
Computational
over half of txn
analysis of core
start sites are
prom oters in the
Drosophila
associated with
genom e
CpG
islands
Uwe Ohler*
† , Guo-chun
Research
Liao*, Heinrich
Niemann‡ and Gerald M
Rubin*§
Ohler, U., Liao, G.C., Niemann, H.,
and Rubin, G.M. Computational
analysis of core promoters in the
Drosophila genome. Genome Biology
3, RESEARCH0087. Epub 2002 Dec
20.
Promoter-proximal tethering elements regulate
enhancer-promoter specificity in the Drosophila
Antennapediacomplex
Vincent C. Calhoun, Angelike Stathopoulos, and Michael Levine
PNAS July 9, 2002 vol. 99 no. 14 9243–9247
Microarray Experiment
involves RNA-DNA base pairing
on spotted DNA chips
Learn all about microarrays at Pat Brown’s Homepage http://cmgm.stanford.edu/pbrown/
Spellman PT, Rubin GM. Evidence
for large domains of similarly expressed genes in the Drosophila genome. J Biol. 2002;1(1):5. Epub 2002 Jun 18.
Genes are organized into
co-expression domains
on average about 10 genes per
100,000 bp (in flies)
We don’t know what
determines the boundaries or
if they are functional
Weitzman JB.
Transcriptional territories
J Biol. 2002;1(1):2. Epub 2002 Jun 25
in the genome.
OUTLINE
(second-half)
1.Identifying regulatory regions by phylogenetic
comparisons in yeast
2. Phylogenetic comparisons in mouse-human
3. Ab initio predictions of enhancers in flies
PHYLOGENETIC APPROACH IN YEAST
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES.
Sequencing and comparison of yeast species to identify genes and regulatory elements.
Nature. 2003 May 15;423(6937):241-54.
Kellis et al. 2003
PHYLOGENETIC APPROACH IN MAMMALS
Identification of a
coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons.
Loots GG, Locksley RM, Blankespoor CM, Wang ZE, Miller W, Rubin EM, Frazer KA.
Science. 2000 Apr 7;288(5463):136-40.
Ab initio Method of predicting enhancers
Scan the Genome for Clusters of
Binding Sites
Cis-Analyst
http://rana.lbl.gov/cis-analyst/
Fly Enhancer
http://flyenhancer.org
Cluster Buster
http://sullivan.bu.edu/cluster-buster/
Defining TF binding sites
SELEX = selected evolution of ligand by
exonential-enrichment
1. Mix your TF with
a pool of all possible 25-mers
+
all 25-mers
TF
2. Isolate 25-mers that bind your TF
-
3. Cut 25-mers out of gel and sequence
bound 25-mers
+
free 25-mers
Selex Results for Dorsal
GGGAATTCCC
GGGAATTCCC
GGGTTATCCC
GGGAATTCCA
gel
Analyze about 30
independently obtained sequences
consensus?
Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB.
Exploiting transcription factor binding site clustering to identify cis-regulatory
modules involved in pattern formation in the Drosophila genome.
Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):757-62.
Berman et al., 2003
Markstein M., unpublished data 2003
REFERENCES
Lander ES, et al.Initial sequencing and analysis of the human genome.
Nature. 2001 Feb 15;409(6822):860-921. Erratum in: Nature 2001 Aug 2;412(6846):565. Nature 2001 Jun
7;411(6838):720.
PMID: 11237011 [PubMed - indexed for MEDLINE]
Felsenfeld G, Groudine M.
Controlling the double helix.
Nature. 2003 Jan 23;421(6921):448-53. Review.
PMID: 12540921 [PubMed - indexed for MEDLINE]
Spellman PT, Rubin GM.
Evidence for large domains of similarly expressed genes in the Drosophila genome.
J Biol. 2002;1(1):5. Epub 2002 Jun 18.
PMID: 12144710 [PubMed - as supplied by publisher]
Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES.
Sequencing and comparison of yeast species to identify genes and regulatory elements.
Nature. 2003 May 15;423(6937):241-54.
PMID: 12748633 [PubMed - indexed for MEDLINE]
Berman BP, Nibu Y, Pfeiffer BD, Tomancak P, Celniker SE, Levine M, Rubin GM, Eisen MB.
Exploiting transcription factor binding site clustering to identify cis-regulatory modules
involved in pattern formation in the Drosophila genome.
Proc Natl Acad Sci U S A. 2002 Jan 22;99(2):757-62.
PMID: 11805330 [PubMed - indexed for MEDLINE]
Levine M, Tjian R.
Transcription regulation and animal diversity.
Nature. 2003 Jul 10;424(6945):147-51. Review.
A Final Look at the Central Dogma
?
Promoter/enhancer predicition
and enhancer trafficking
This figure (minus the arrow and quetsion mark) is from Albert’s Molecular Biology of the Cell, 4th edition