Download Regulatory regions and regulatory elements - ENS-phys

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Regulatory Sequence Analysis
Regulatory regions
and regulatory elements
Jacques van Helden
[email protected]
Tra
ns
cri
be
tiv
e
rep
eti
no
n
-co
g
din
s/M
ne
co
Mb
ge
ne
Ge
Siz
e
Organism
Ye
ar
s
b
din
g
d
The non-coding genome
%
%
%
%
Mycoplasma genitalium
Haemophilus influenzae
Escherichia coli
1995
1995
1997
0.6
1.8
4.6
481
1 717
4 289
802 90
954 86
932 87
10
14
13
Saccharomyces cerevisiae
1996
12
6 286
524 72
28
Arabidiopsis thaliana
Caenorhabditis elegans
Drosophila melanogaster
Homo sapiens
2001 120
1998
97
2000 165
2001 3 200
27 000
19 000
16 000
31 000
225 30
196 27
97 15
10 3
70
73
85
97
46 28
Transcriptional activation
Transcriptional
activator
activation
domain
RNA
polymerase
DNA-binding
domain
enhancer
initiation
Transcriptional repression
Prevent RNA polymerase from
accessing DNA
Competition for factor binding site
Factor titration
Prevent transcription factor from
interacting with RNA-polymerase
(bind with activation domain)
Transcription factor-DNA interfaces
A
C
B
D
Phosphate utilization in yeast
PHO2
PHO3
acid phosphatase
expression
codes for
acid phosphatase
up-regulates
Up-regulates
issecretion
secreted
expression
codes for
expression
up-regulates
PHO5, 11,12
catalyzes
up-regulates
Pho2p
Pi transporter
up-regulates
expression
codes for
orthophosphoric monoester
PHO84, 86,87,88,89
alkaline phosphatase
up-regulates
Pho4p
expression
codes for
H2O
orthophosphoric monoester
PHO8
3.1.3.1
H2O
PHO4
up-regulates
(nucleus)
catalyzes
3.1.3.2
PHO81
up-regulates
transports
facilitates
alcohol
expression
codes for
is
translocates
tranferred
Pi
expression
codes for
is tranferred
transport
Pho4p-Phosphate
Pho4p
alcohol
Pi
2.7.1.-
catalyzes
inhibits
Pho81p
Pho80p Pho85p
Pho85-Pho80 complex
inhibits
(cytoplasm)
extracellular space
Pho4p binding sites
gene
PHO5
PHO5
PHO5
PHO8
PHO8
PHO81
PHO84
PHO84
PHO84
PHO84
PHO84
PHO5
PHO5
PHO5
start end sequence
-260 -242 ..GCACTCACACGTGGGACTA
-260 -245 ..GCACTCACACGTGGGA
-262 -239 TGGCACTCACACGTGGGACTAGCA
-540 -522 ...TCGGGCCACGTGCAGCGAT
-736 -718 ..ttacccgCACGCTTaatat
-350 -332 ...TTATGGCACGTGCGAATAA
-421 -403 ..TTTCCAGCACGTGGGGCGG
-442 -425 ...TAGTTCCACGTGGACGTG
-879 -874 .aaaagtgtCACGTGataaaaat
-267 -250 ..taatacgCACGTTTTTaa
-592 -575 ....TTACGCACGTTGGTGCTG
-368 -349 ...AATTAGCACGTTTTCGCATA
(?)
(?)
..AAATTAGCACGTTTCGC
-370 -347 .TAAATTAGCACGTTTTCGCATAGA
IUPAC ambiguous nucleotide code
A
C
G
T
R
Y
W
S
M
K
H
B
V
D
N
A
C
G
T
A or G
C or T
A or T
G or C
A or C
G or T
A, C or T
G, C or T
G, A, C
G, A or T
G, A, C or T
Adenine
Cytosine
Guanine
Thymine
puRine
pYrimidine
Weak hydrogen bonding
Strong hydrogen bonding
aMino group at common position
Keto group at common position
not G
not A
not T
not C
aNy
Pho4p binding specificity - matrix descriptions
C
A
C
G
T
Pho4p
14 0
5
7
6
0 26 0
0
0
0
3
2
8
5 16 6 26 0 26 0
1
0
4
4
2
1
1 12 0
0
0 26 0 16 12
6 16 15 2
2
0
0
0
0 25 10 7
A
C
G
T
Pho4p.cacgtg
2 17 0
0
0
0
2
1
16 0 18 0
0
0
6
3
0
1
0 18 0 18 9 12
0
0
0
0 18 0
1
2
D
E
A
C
G
T
7
0
0
1
0
1
0
7
2
1
0
5
5
3
0
0
Pho4p.cacgtt
1
0
8
0
3
8
0
8
4
0
0
0
0
0
0
0
8
4
2
4
5
5
5
3
5
0
2
11
13
1
1
3
0
0
8
0
0
0
0
8
0
0
0
8
1
0
2
5
Sequence logo
Rap1
Rpn4
Gcn4
HSE
Mig1
Cbf1
Shannon uncertainty

Shannon uncertainty



Hs(j): uncertainty of a column of a PSSM
Hg: uncertainty of the background (e.g. a genome)
Properties of the uncertainty (for a 4 letter alphabet)

min(H)=0
•

H=1
•


H s ( j ) = −∑ f ij log 2 ( f ij )
i=1
A
H g = −∑ pi log 2 ( pi )
No uncertainty at all: the nucleotide is completely specified (e.g.
p={1,0,0,0})
R
seq
Uncertainty between two letters (e.g. p={0.5,0,0,0.5})
max(H) = 2
•
A
Complete uncertainty: one bit of information is required to
specify the choice between each alternative (e.g.
p={0.25,0.25,0.25,0.25})
i=1
( j) = Hg − Hs( j)
w
Rseq = ∑ Rseq ( j )
j=1
Schneider (1986) defines an information content Rseq
based on Shannon’s uncertainty.
€
Source: Schneider (1986)
Schneider logos

Schneider (1990) proposes a graphical
representation based on his previous entropy (H)
for representing the importance of each residue at
each position of an alignment. He provides a new
formula for Rseq




Hs(j)
Rseq(j)
e(n)
uncertainty of column j
information content of column j
correction for small samples
(pseudo-weight)
Remarks



A
H s ( j ) = −∑ f ij log 2 ( f ij )
i=1
Rseq ( j ) = 2 − H s ( j ) + e( n )
hij = f ij Rseq ( j)
€
This information content does not include any
correction for the prior residue probabilities (pi)
This information content is expressed in bits.
Boundaries
•
•
min(Rseq)=0
max(Rseq)=2
equiprobable residues
perfect conservation of 1 residue,
all the others are forbidden
http://www.lecb.ncifcrf.gov/~toms/icons/tata.gif
References - Sequence logoo



Schneider, T.D., G.D. Stormo, L. Gold, and A. Ehrenfeucht. 1986.
Information content of binding sites on nucleotide sequences. J Mol
Biol 188: 415-431.
Schneider, T.D. and R.M. Stephens. 1990. Sequence logos: a new way
to display consensus sequences. Nucleic Acids Res 18: 6097-6100.
Tom Schneider’s publications online
• http://www.lecb.ncifcrf.gov/~toms/paper/index.html
Methionine Biosynthesis in S.cerevisiae
Aspartate
biosynthesis
L-Aspartate
ATP
ADP
2.7.2.4
Aspartate kinase
HOM3
Aspartate semialdehyde
deshydrogenase
HOM2
Homoserine
deshydrogenase
HOM6
L-aspartyl-4-P
NADPH
NADP+; Pi
1.2.1.11
L-aspartic semialdehyde
Threonine
biosynthesis
NADPH
NADP+
1.1.1.3
L-Homoserine
AcetlyCoA
CoA
2.3.1.31
Met31p
met32p
Homoserine
O-acetyltransferase
MET2
O-acetylhomoserine
(thiol)-lyase
MET17
MET31
MET32
O-acetyl-homoserine
Sulfur
assimilation
Sulfide
4.2.99.10
MET28
Homocysteine
Cysteine biosynthesis
5-methyltetrahydropteroyltri-L-glutamate
5-tetrahydropteroyltri-L-glutamate
2.1.1.14
Methionine synthase
(vit B12-independent)
MET6
Cbf1p/Met4p/Met28p
complex
CBF1
MET4
Gcn4p
GCN4
L-Methionine
S-adenosyl-methionine
synthetase I
H20; ATP
2.5.1.6
S-adenosyl-methionine
Pi, PPi
synthetase II
S-Adenosyl-L-Methionine
SAM1
SAM2
Met30p
MET30
Met4p binding sites
gene
MET3
MET3
MET14
MET16
ECM17
ECM17
MET10
MET10
MET2
MET2
MET17
MET17
MET6
MET6
SAM2
SAM2
A
C
G
T
13
1
1
1
11
0
1
4
start
-367
-384
-235
-185
-311
-339
-255
-237
-360
-554
-306
-332
-540
-502
-329
-381
end
-349
-366
-217
-167
-293
-321
-237
-219
-342
-536
-288
-314
-522
-484
-311
-363
3
0
4
9
3
3
4
6
sequence
GAAAAGTCACGTGTAATTT
AAAAGGTCACGTGACCAGA
CTAATTTCACGTGATCAAT
ATCATTTCACGTGGCTAGT
ATTTCATCACGTGCGTATT
.TTTGTCCACGTGATATTTC
.CCACACCACGTGAGCTTAT
.TAGAAGCACGTGACCACAA
GTATTTTCACGTGATGCGC
TAATAATCACGTGATATTT
.AAATGGCACGTGAAGCTGT
TTGAGGTCACATGATCGCA
GCCACATCACGTGCACATT
AATATTTCACGTGACTTAC
.TCTACCCACGTGACTATAA
.TCTTCACATGTGATTCATC
2
0 16 0
1
0
0 12
0 16 0 15 0
0
0
0
4
0
0
0 15 0 16 4
10 0
0
1
0 16 0
0
Met31p binding sites
gene
MET14
MET2
MET17
MET6
SAM2
SAM1
MET19
MUP3
MET8
MET1
MET3
MET28
MET8
MET30
MET6
A
C
G
T
5
2
5
2
start
-202
-313
-227
-313
-306
-283
-173
-188
-184
-232
-259
-159
-434
-168
-405
end
-182
-293
-207
-293
-286
-263
-153
-168
-164
-212
-239
-139
-414
-148
-385
sequence
CCTCAAAAAATGTGGCAATGG
TGCAAAAAATTGTGGATGCAC
TCATGAAAACTGTGTAACATA
GTCGCAAAACTGTGGTAGTCA
GCTTGAAAACTGTGGCGTTTT
ACAGGAAAACTGTGGTGGCGC
ATAAGCAAACTGTGGGTTCAT
CGGAAAAAACTGTGGCGTCGC
GGAAAAAAAATGTGAAAATCG
CATAATAAACTGTGAACGGAC
ACAAAGCCACAGTTTTACAAC
CTAACACCACAGTTTTGGGCG
TCTTGTCCGCAGTTTTATCTG
GGGAAGCCACAGTTTGCGCGG
CTATCGAACTCGTTTAGTCGC
11 14 14 14 2
0
0
0
0
2
2
0
0
0 11 0
0
1
0
0
0
0
0
0
0
0 14 0 14 11
1
0
0
0
1 14 0 13 0
1
5
5
1
3
Characteristics of yeast regulatory sites


Located upstream the regulated gene
Short DNA sequences (5-30 bp)





Highly conserved core (5-8 bp), with partly conserved flanking
nucleotides
Pair of very shot oligonucleotides (3 nt) separated by a nonconserved segment (0-20 bp)
Strand-insensitive
Wihtin 800 bp from the start codon
Efficiency dos not depend on


strand
position
Pattern matching vs pattern discovery
Set of DNA sequences
Yes
Pattern matching
Matching
positions
Pattern
known ?
No
Pattern discovery
Putative regulatory
patterns
Questions and approaches
1.
If we know the consensus for a given transcription factor, can we predict its
binding sites in a DNA sequence ?

2.
Can we scan a sequence for matches with the consensus of all he
currently known transcription factor ?

3.
Pattern discovery within a sequence set
Can we detect regulatory signals by searching conserved elements in noncoding sequences of orthologous genes ?

5.
Matching a library of patterns against a sequence
Starting from a set of co-expressed genes, can we predict cis-acting
elements involved in their transcriptional regulation ?

4.
Pattern matching against a sequence
Phylogenetic footprinting
Can we predict gene regulation on the basis of the presence of regulatory
motifs in their regulatory regions ?

Gene classification on the basis of pattern scores
Typical situations : pattern discovery

Selected sequence set


e.g. family of 20 co-regulated genes, obtained from DNA chip
experiment
→ identify putative regulatory sites
Genome-scale pattern discovery
θ

e.g. all upstream sequences
→ identify transcription initiation signals
e.g. all downstream sequences
→ identify 3' maturation signals
Typical situations : pattern matching

Selected genes, selected patterns
θ
ν
Selected genes, library of patterns
θ

e.g. 10 genes known to be regulated by a factor
→ search matching positions
→ infer putative action of any previously known transcription
factor
All genes, selected patterns
θ
→ classify all the genes of a genome according to putative
regulatory properties
Differences between species
organism
location
yeast
upstream
coli
upstream
overlap. Initiation
distance range
-800 to -1 bp
-400 to +50 bp
higher organisms
upstream
downstream
within introns
over 100s of Kb
position effect
often irrelevant
often essential
often irrelevant
strand
insensitive
sensitive or symmetric
insensitive
spaced pair of 3nt
~5-8 conserved bp
rare
frequent
most common ~5-8 conserved bp
core
repeated sites
occasional
composite
elements
frequent
Related documents