Download Regulatory sequences

Document related concepts

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genome (book) wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

DNA vaccination wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

Non-coding RNA wikipedia , lookup

Epitranscriptome wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Transposable element wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genomics wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Epigenetics of depression wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Genetic engineering wikipedia , lookup

Gene therapy wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenomics wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene desert wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome editing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Gene wikipedia , lookup

RNA-Seq wikipedia , lookup

Point mutation wikipedia , lookup

Transcription factor wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Primary transcript wikipedia , lookup

NEDD9 wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Transcript
Regulatory Sequences
(Basics)
Alexander Kel
Senior Vice President of Genome
Informatics,
BIOBASE GmbH,
Halchtersche Strasse 33
D-38304 Wolfenbuettel
Germany
www.biobase.de
Pathway builder
Array analyser
TRANSPATH
- mechanistic
- semantic
S/MARt DB
Patho DB
TRANSFAC
Match
Patch
Catch
CMFinder
TRANSCompel
Cytomer
TRANSGenome
TRANSPLORER
BIOBASE customers*
TRANSFAC
Syngenta
Celera
Monsanto
Pfizer
Merck Sharp & Dome
Amgen
Takeda
Novartis
GlaxoSmithKline
TRANSPATH
Vertex
More than 200 academic
labs including:
Harvard
Stanford
Tokyo University
Riken Labs
Max Planck
More than 7000 registered
users on our portal
gene-regulation.com
Both
Aventis
Eli Lilly
Schering Plough
Hoffmann La Roche
Akzo Nobel
* not complete
Same blocks - different structures
LEGO system
Concepts of gene regulation
DNA
amplification, methylation,
chromatin structure
transcription
RNA
information carrier 1
carrier organization
transformation
splicing, degradation
translation
protein
modification, degradation
information carrier 2
Gene structure
TRANSFAC
Regulatory
Elements
Gene
Contig
3‘
5‘
Transcription
primary
transcript
Splicing
Splice
Variants
mRNA
altern.
exon
5’-UTR
CDS
3’-UTR
Gene structure
TRANSFAC
Regulatory
Elements
Gene
Contig
3‘
5‘
Transcription
primary
transcript
Splicing
Splice
Variants
mRNA
altern.
exon
5’-UTR
CDS
3’-UTR
General schema of the modular hierarchical
structure of transcription regulatory regions of
eukaryotic genes.
TSS
enhancer 2
enhancer 1
box A‘
promoter
box C box B
composite
element
box A‘‘
box G box F
box D‘
box E
box D
box A
TATA
box
initiator
Inr
trans
cis
…
Human genes
Sequences and positions of AP-1 binding sites
glutathione Ptransferase
enhancer at -2500
hemoglobin,
epsilon
TGAСTTT
-80 н.п.
TGACATC
Akt-2
IFN-
-100 н.п.
TGTCACC
-89 н.п.
Apo АII
TGACTCA
-792 н.п.
TGAGTCA
Melanotransferin
-2013 н.п.
Collagenase
TGAGTCA
-72 н.п.
proto-oncogene
c-myc
porphobilinogen
deaminase
TGATTTA
-335 н.п.
TGACTCA
-162 н.п.
GM-CSF
TGACTCA
enhancer at -3500
What is a transcription factor?
A transcription factor is a protein that regulates transcription
after nuclear translocation
by specific interaction with DNA
or by stoichiometric interaction with a protein that can be assembled
into a sequence-specific DNA-protein complex.
Transcription factors
Sequencespecific DNA
binding
Non-DNA
binding
HAT
Layer III
Co-activator
Layer II
Layer I
DNA
adapter
TF1
TF2
TF3
TF4
Structure of transcription factors
USF-1, dimer
Structure of transcription factors
oligomerization
domain
Ligandbinding
domain
Activation
domain
Protein-protein
interaction
domain
DNA binding
domain
N
Gene
1.
Scavenger
receptor,
Homo sapiens
Schema and positions of a CE
TRANSCompel
accession number
C00080
Ets
AP-1
Enhancer –4500/-4100
2.
-53
:
GM-CSF,
Mus musculus
-40
:
3.
4.
Collagenase,
Homo sapiens
-89
:
-82
:
C00081
Ets
AP-1
-72
:
Ets
-66
:
C00083
AP-1
IgH ,
Mus musculus
C00133
Ets
AP-1
Enhancer at 3’ flank
5.
6.
7.
8.
9.
10.
Interleukin 2,
Homo sapiens
Interleukin 2,
Homo sapiens
Интерлейкин 2,
Mus musculus
-283
:
NFAT
IRF-1, Mus
musculus
AP-1
-167
:
C00109
-142
:
NF-B
AP-1
-167
:
IgH,
Homo sapiens
Сывороточный
амилоид А1,
Rattus
norvegicus
-268
:
C00165
-142
:
AP-1
Oct-2
Ets
CBF
C00158
C00173
-117
:
-73
:
C/EBP
-123
:
-113
:
STAT-1
С00101
NF-B
-49
:
-40
:
NF-B
C00192
Ternary complex NFATp - AP1 - DNA
Composite elements
Minimal functional units where both protein-DNA and protein-protein
interactions contribute to a highly specific pattern of gene expression
and provide cross-coupling of different signal transduction pathways.
F2
F1
Low level
of transcription
Low level
of transcription
F1
F2
Synergistic activation of
transcription
F1
F2
Integration of signals. Cross-coupling of signal transduction pathways
Membrane receptor
Ca2+ dependent canal
Src
Ras
SH2
Ras
SH3
Phosphorylation
Ca2+
Ca2+
GTP
GDP
PLC
Adaptors
PI3-K
Ca2+
cytoplasm
IP3
Calcineurin
PKB/Akt
P
NFATp
ERK
JNK
NFATp
ERK
NFATp
Nucleus
c-Fos
P38MAPK
JNK
c-Fos
IL-2
P
P
c-Jun
с-Fos c-Jun
Composite element
P38MAPK
c-Jun
ATF-2
c-Jun ATF-2
ATF-2
Mechanisms of functioning of synergistic composite elements
1)
F1
S1
F2
F1
F2
S2
S1
S2
F2
F1
F2
S2
S1
S2
2)
F1
S1
Cooperative binding to DNA
and ternary complex formation
A new protein surface for
DNA recognition could be
formed
3)
F1 F2
S1
S2
Simultaneous interaction
of activation domains with the
components of the basal complex
Mechanisms of functioning of synergistic composite elements
4)
F1
F2
S1
S2
Forming a new protein
surface for interaction with
the basal complex
5)
F1
F1
s1
F2
F2
s2
Relief of autoinhibition
as a result of proteinprotein interactions
Mechanisms of functioning of synergistic composite elements
6)
DNA bending by one of the
transcription factors
F1
S1
F2
S2
7)
DNA wrapping around
a nucleosome allows
transcription factors to interact
F1
F2
8)
HAT complex
F1
F2
S1
S2
Recruitment of a HAT complex
by one of the transcription factors
Mechanisms of functioning of antagonistic composite elements
1)
HAT complex
Mutually exclusive binding of
factor F1(activator)
and F2 (repressor)
HDAC complex
Mechanisms of functioning of antagonistic composite elements
2)
HAT complex
Binding of F2 (repressor)
results in the conformational
changes of F1 (activator)
HDAC complex
Mouse IL-4 promoter
AP-1
HMG Y
STAT 6
AP-1
NF-Y
AP-1
HMG Y
AP-1
c-MAF
TATA
NFAT
NFAT
NFAT
NFAT
CE
-249
-180
-150
-114
-88
ST
NFAT
CE
-60
-28
+1
AP-1
AP-1
AP-1
CBF
AP-1
NF-B
c-Rel/p65
NF-B
p50/p65
GM-CSF
Homo sapiens
CBF
AP-1
TATTT
NFAT
NFAT
CE
NFAT
CE
T-cell specific inducible enhancer at –3500 bp
NFAT
CE
NFAT
HMG Y(I)
-114
-88
CD28 response element
-54
CE
Promoter
ST
+1
Enhanceosome
Recruitment of CIITA to MHC-II promoters. A prototypical MHC-II promoter (HLA-DRA) is represented schematically with the W, X, X2,
and Y sequences conserved in all MHC-II, Ii, and HLA-DM promoters. RFX, X2BP, NF-Y, and an as yet undefined W-binding protein bind
cooperatively to these sequences and assemble into a stable higher order nucleoprotein complex referred to here as the MHC-II
enhanceosome. CIITA is tethered to the enhanceosome via multiple weak protein-protein interactions with the W, X, X2, and Y-binding
factors. The octamer site found in the HLA-DRA promoter (O), and its cognate activators (Oct and OBF-1) are not required for recruitment of
CIITA. CIITA is proposed to activate transcription (arrow) via its amino-terminal activation domains (AD), which contact the RNA
polymerase II basal transcription machinery.
Masternak K et al., Genes Dev 2000 May 1;14(9):1156-66
Closed nucleosomes
Site-specific TF
Acetylase
Acetilase
PCAF
Co-activator
p300/CBP
Acetilation
Acetylation
TFIID
TFIIA
TFIIB
TFIIF
TFIIE
RNA pol II
TFIIH
S/MARs
Scaffold/matrix attached regions (S/MARs)
are regions of the DNA strand that are found
the basis of chromatin loops. They anchor
the DNA to the proteinaceous nuclear
matrix.
Each loop is considered to be a functional
domain.
S/MARs
genes
residual DNA
S/MARs may act as border elements and
thus, protect gene expression from position
effects.
S/MARs
open chromatin
promoter
enhancer
gene
compact chromatin
(transcribed
region)
SAR
LCR
(regulated)
SAR
SAR
LCR
SAR
nuclear scaffold
J. Bode / E. Wingender 1993
Databases on gene regulation
BKL: collected information is displayed in a
‘one page per protein’ format = Protein Reports
•
Clear identification of
where you are (which
species and which protein).
•
Tabular presentation of
controlled-vocabulary
terms.
•
Annotations linked to
PubMed references.
•
Clear paths of navigation
between protein reports,
within a species and
between species.
•
Links to ‘public domain’
databases.
N
1.
2.
Databases containing gene
regulation information
EMBL Nucleotide sequence database
GeneBank
3.
4.
5.
6.
SWISS-PROT
PIR: Protein Information Resourсe
PDB
EPD - Eukaryotic promoter database
7.
8.
9.
10.
11.
TRANSFAC
TRRD
COMPEL
TFD - Transcription factor database
RegulonDB
12. SCPD - The Promoter Database of
Saccharomyces cerevisiae
13. Muscle-Specific Regulation of
Transcription (A Catalogue of
Regulatory Elements)
14. EpoDB. (Database of genes that
relate to vertebrate red blood cells)
15. GENET
URL
http://www.ebi.ac.uk/embl.html
http://www.ncbi.nlm.nih.gov/Web/Genbank/inde
x.html
http://www.expasy.ch
http://www-nbrf.georgetown.edu/pir
http://www.pdb.bnl.gov/
http://www.epd.isb-sib.ch
http://transfac.gbf.de/TRANSFAC
http://www.bionet.nsc.ru/trrd/
http://compel.bionet.nsc.ru/
http://www.ifti.org/
http://www.cifn.unam.mx/Computational_Biolog
y/regulondb
http://cgsigma.cshl.org/jian/
http://agave.humgen.upenn.edu/MTIR/HomePage
.html
http://agave.hum-gen.upenn.edu/epodb/
http://www.iephb.ru/~spirov/genet00.html
16. PlantCARE
http://sphinx.rug.ac.be:8080/PlantCARE/
17. PLACE
18 DBTSS
http://www.dna.affrc.go.jp/htdocs/PLACE/
http://dbtss.hgc.jp/
EMBL data library
Feature gene
Definition
region of biological interest identified as a gene and for which a name has been
assigned;
Optional Qualifiers /allele="text"
/citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/function="text"
/label=
/map="text"
/note="text"
/product="text"
/pseudo
/phenotype="text"
/standard_name="text"
/usedin=accnum:feature_label
Comments the gene feature describes the interval of DNA that
corresponds to a genetic trait or phenotype; the feature is,
by definition, not strictly bound to it's positions at the
ends; it is meant to represent a region where the gene is
located.
EMBL data library
Feature promoter
Definition
region on a DNA molecule involved in RNA polymerase binding to initiate
transcription;
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/function="text"
/gene="text"
/label=feature_label
/map="text"
/note="text"
/phenotype="text"
/pseudo
/standard_name="text"
/usedin=accnum:feature_label
Molecule Scope DNA
or look for: (start of) mRNA, or precursor_RNA, or prim_transcript, or exon /number=1, ...
EMBL data library
Feature misc_feature
Definition
region of biological interest which cannot be described by any other feature
key; a new or rare feature;
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/function="text"
/gene="text"
/label=feature_label
/map="text"
/note="text"
/number=unquoted
/phenotype="text"
/product="text"
/pseudo
/standard_name="text"
/usedin=accnum:feature_label
Comments this key should not be used when the need is merely to mark a
region in order to comment on it or to use it in another
feature's location; use the '-' pseudo-key instead.
e.g.:
FT
misc_feature
FT
FT
4538
/note="transcription initiation site«
/gene="CDC6"
EMBL data library
Feature enhancer
a cis-acting sequence that increases the utilization of (some) eukaryotic
Definition promoters, and can function in either orientation and in any location (upstream
or downstream) relative to the promoter;
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/label=feature_label
/gene="text
/map="text"
/note="text"
/standard_name="text"
/usedin=accnum:feature_label
Organism Scope eukaryotes and eukaryotic viruses
EMBL data library
Feature protein_bind
Definition non-covalent protein binding site on nucleic acid;
Mandatory Qualifiers /bound_moiety="text"
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/function="text"
/gene="text"
/label=feature_label
/map="text"
/note="text"
/standard_name="text"
/usedin=accnum:feature_label
Comments note that RBS is used for ribosome binding sites.
EMBL data library
Qualifier bound_moiety
Definition moiety bound
Value Format "text"
Example /bound_moiety="repressor"
Qualifier usedin
Definition indicates that the feature is used in a compound feature in another entry
Value Format
Accession-number:feature-name or
Database_name::Acc_number:feature_label
Example /usedin=X10087:proteinx
Comment
database_name is an abbreviation for the name of the database in
which the entry for the accession number can be found.
EMBL data library
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
Key
Location/Qualifiers
source
1..4734
/db_xref="taxon:9606„
/sequenced_mol="DNA„
/organism="Homo sapiens„
4495..4502
/bound_moiety="E2F„
4529..4537
/bound_moiety="E2F„
4538
/note="transcription initiation site«
/gene="CDC6"
protein_bind
protein_bind
misc_feature
experimentally confirmed sites,
though no /evidence qualifier is given
EMBL data library
FH
FH
FT
FT
FT
FT
FT
FT
FT
FT
FT
...
FT
FT
FT
FT
...
FT
FT
FT
FT
FT
FT
FT
Key
Location/Qualifiers
source
1..3204
/db_xref="taxon:9606„
/sequenced_mol="DNA„
/organism="Homo sapiens„
1..3201
/note="melanocortin-1 receptor„
/gene="MC1R„
570..575
/note="E-BOX„
FT
FT
FT
FT
FT
FT
misc_binding
promoter
misc_signal
TATA_signal
protein_bind
922..941
1343..1350
/evidence=EXPERIMENTAL
/bound_moiety="AP-1„
TATA_signal
misc_binding
1553..1559...
1957..1964
/evidence=EXPERIMENTAL
/bound_moiety="AP-2„
2060..2067
/evidence=EXPERIMENTAL
/bound_moiety="AP-2„
misc_binding
misc_binding
2069..2074
/evidence=EXPERIMENTAL
/bound_moiety="SP-1„
2603..2608
/evidence=EXPERIMENTAL
/bound_moiety="SP-1"
Here:
misc_signal "E-BOX" and
TATA_signal are identified by
homology and positional reasoning,
AP-1 and AP-2 binding sites are
suggested by homology,
Sp1 sites are confirmed by gel shift
analysis
EMBL data library
Feature TATA_signal
TATA box; Goldberg-Hogness box; a conserved AT-rich septamer found about
25 bp before the start point of each eukaryotic RNA polymerase II transcript
Definition
unit which may be involved in positioning the enzyme for correct initiation;
consensus=TATA(A or T)A(A or T) [1,2];
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/gene="text"
/label=feature_label
/map="text"
/note="text"
/usedin=accnum:feature_label
Organism Scope eukaryotes and eukaryotic viruses
Molecule Scope DNA
References [1] Efstratiadis, A. et al. Cell 21, 653-668 (1980)
[2] Corden, J., et al. "Promoter sequences of eukaryotic protein-encoding
genes" Science 209, 1406-1414 (1980)
EMBL data library
Feature CAAT_signal
CAAT box; part of a conserved sequence located about 75 bp up-stream of the
Definition start point of eukaryotic transcription units which may be involved in RNA
polymerase binding; consensus=GG(C or T)CAATCT [1,2].
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/gene="text"
/label=feature_label
/gene="text"
Feature GC_signal
/map="text"
/note="text"
GC box; a conserved GC-rich region located upstream of the start point of
/usedin=accnum:feature_label
Definition eukaryotic transcription units which may occur in multiple copies or in either
consensus=GGGCGG;
eukaryotes and
eukaryotic viruses
Organism Scope orientation;
Optional
Qualifiers
DNA
Molecule
Scope /citation=[number]
/db_xref="<database>:<identifier>"
[1] Efstratiadis, A. et al. Cell 21, 653-668 (1980)
References /evidence=<evidence_value>
/gene="text"
[2] Nevins, J.R. "The pathway of eukaryotic mRNA formation" Ann Rev
/label=feature_label
Biochem 52, 441-466 (1983)
/map="text"
/note="text"
/usedin=accnum:feature_label
EMBL data library
Feature misc_signal
any region containing a signal controlling or altering gene function or
expression that cannot be described by other signal keys (promoter,
Definition
CAAT_signal, TATA_signal, -35_signal, -10_signal, GC_signal, RBS,
polyA_signal, enhancer, attenuator, terminator, and rep_origin).
Optional Qualifiers /citation=[number]
/db_xref="<database>:<identifier>"
/evidence=<evidence_value>
/function="text"
/gene="text"
/label=feature_label
/map="text"
/note="text"
/phenotype="text"
/standard_name="text"
/usedin=accnum:feature_label
EMBL data library
ID
XX
AC
...
FT
FT
FT
FT
MMIGHALP
ID
XX
AC
XX
SV
XX
DT
DT
XX
DE
XX
KW
FT
FT
FT
FT
FT
FT
FT
FT
SSLCREG1
standard; DNA; MUS; 17956 BP.
X96607;
enhancer
4537..6107
/note="locus control region„
/note="alpha„
/gene="IgH"
standard; DNA; MAM; 1190 BP.
X86793;
X86793.1
10-MAY-1995 (Rel. 43, Created)
30-MAY-1995 (Rel. 43, Last updated, Version 3)
S.scrofa locus control region (1190 bp)
locus control region. ...
source
1..1190
/chromosome="9„
/db_xref="taxon:9823„
/organism="Sus scrofa„
/clone_lib="clonetech„
/map="p2.4„
5..1190
/note="locus control region (HSI)"
Eukaryotic Promoter Database (EPD)
Praz et al., Nucleic Acids Res. 30, 322-324
http://www.epd.isb-sib.ch
Eukaryotic Promoter Database (EPD)
All EPD
4809
Vertebrate promoters
2540
Arthropode promoters
2000
Plant promoters
198
Viral
129
Nematode promoters
26
Praz et al., Nucleic Acids Res. 30, 322-324 (2002)
http://www.epd.isb-sib.ch
Eukaryotic Promoter Database (EPD)
ID
XX
AC
XX
DT
DT
HS_MYC_1
standard; single; VRT.
DE
DE
OS
XX
HG
AP
NP
XX
DR
DR
DR
...
DR
DR
...
DR
c-myc (cellular homologue of myelocytomatosis virus 29 oncogene),
promoter 1, MYC gene.
Homo sapiens (human).
EP11146;
??-APR-1987 (Rel. 11, created)
10-OCT-2001 (Rel. 69, Last annotation update).XX
Homology group 52; Mammalian c-myc proto-oncogene, promoter 1.
Alternative promoter #1 of 2; exon 1; site 1.
none.
EPD; EP11148; HS_MYC_2; alternative promoter; [+162; +].
EPDEX; HS_MYC.
EMBL; X00364.2; HSMYCC; [-2327, 8669]. [ EMBL; GenBank; DDBJ ]
SWISS-PROT; P01106; MYC_HUMAN.
TRANSFAC; R01157; HS$CMYC_01; [-49,-27]; by position.
MIM; 190080.
Eukaryotic Promoter Database (EPD)
...
DR
XX
RN
RX
RA
RA
RT
RT
RL
XX
ME
ME
ME
XX
SE
XX
TX
TX
TX
TX
XX
KW
KW
XX
FP
XX
DO
DO
RF
//
MIM; 190080.
[1]
MEDLINE; 84026482.
Battey J., Moulding C., Taub R., Murphy W., Stewart T., Potter H.,
Lenoir G., Leder P.;
"The human c-myc oncogene: structural consequences of
translocation into the IgH locus in Burkitt lymphoma";
Cell 34:779-787(1983)....
Nuclease protection [2].
Nuclease protection; transfected or transformed cells [3].
Primer extension [2].
aatctccgcccaccggccctttataatgcgagggtctggacggctgaggACCCCCGAGCT
6. Vertebrate promoters
6.1. Chromosomal genes
6.1.5. Hormones, growth factors, regulatory proteins
6.1.5.16. Various cellular protooncogenes
Proto-oncogene, Nuclear protein, DNA-binding, Glycoprotein,
Transcription regulation.
Hs c-myc
P1 :+S
EM:X00364.2
1+
Experimental evidence: 3,3#,6
Expression/Regulation: +mitogen;+IL-2
Cell34:779
PNAS80:6307
MCB7:1393
2328; 11146.052 010*1
MCB7:2988
RegulonDB
Salgado et al., Nucleic Acids Res. 29, 72-74 (2001)
http://www.cifn.unam.mx/Computational_Genomics/regulondb/
SCPD
Zhu & Zhang, Bioinformatics 15, 607-611 (1999)
http://cgsigma.cshl.org/jian/
PlantCARE
Rombauts et al., Nucleic Acids Res. 27, 295-296 (1999)
http://sphinx.rug.ac.be:8080/PlantCARE/cgi/index.html
Schematic representation of "Oligo-capping" method
TRRD
Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002)
http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/
TRRD
Kolchanov et al., Nucleic Acids Res. 30, 312-317 (2002)
http://wwwmgs.bionet.nsc.ru/mgs/gnw/trrd/
TRANSFAC®
a database on gene transcription regulation
contains
SITE
GENE
binds to and
regulates
is used to
construct
encodes
for
FACTOR
is an attribute
of
MATRIX
interacts
TRANSFAC structure
CLASS
SPECIES
FEATURES
interacting
factor
SYNONYMS
FACTOR
MATRIX
CELL
gene
METHOD
expression
SITE
regulatory region
SEQUENCE
FUNCTIONAL ELEMENT
GENE
coding region
Manual annotation of the databases: input client
TRANSFAC: FACTOR table, protein sequence
TRANSFAC: FACTOR table, protein domains
TRANSFAC: FACTOR table, structural and functional features
TRANSFAC: FACTOR table, links to other databases
TRANSFAC: classification of transcription factors
TRANSFAC: CLASS table
TRANSFAC 8.1 (2004-03-31): number of factor entries for different
species
1400
human
plants
1200
1000
mouse
other
vertebrates
800
600
Fungi
rat
Other
400
fruit fly
200
0
TRANSFAC 8.1 (2004-03-31): distribution of experimentally
known TFBS in 5‘ regions of genes.
800
700
600
500
400
300
200
100
15
00
30
00
50
00
50
0
30
0
10
0
-5
0
-1
50
-3
50
-2
50
-6
00
-4
50
-4
00
0
-2
00
0
-1
00
0
-8
00
-1
00
00
0
TRANSFAC: FACTOR table, protein-DNA and protein-protein
interactions
TRANSFAC: MATRIX table
TRANSFAC® : accompanying tools
PatchTM- pattern search
MatchTM- PWM-based search
gATTGGCGCGAAGtttt
gATTGGCGCGAAGtttt
aCAGGGCGCCAAAcgcg
aCAGGGCGCCAAAcgcg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
aTTTCGCGCCAAActtg
GGCTGCGGCCAAAtctc
ATCTCCCGCCAGGtcag
aGTTCGCGGGCAAatgc
GGCTGCGGCCAAAtctc
ATCTCCCGCCAGGtcag
aGTTCGCGGGCAAatgc
cTTCGGCGCGCGGtgtt
cTTCGGCGCGCGGtgtt
tTTTCGCGCCAAAgtca
tTTTCGCGCCAAAgtca
tTTTGCCGCGAAAagac
tTTTGCCGCGAAAagac
q1
q2
TM
Selection of DNA binding sites by
regulatory proteins
Statistical-mechanical theory
O.G. Berg and P.H von Hippel
Match
Mutational drift
Mismatch
1
2
A
T
0.5
0.9
0.0
0.1
G
C
1.2
0.0
0.1
0.8
...
l
...
s
l0  0
 lB
1) Binding affinity of protein to DNA in some useful range
2) Number of sequences is large.
3) All possible sequences are equiprobable
4)  lB - express the decrease in binding energy when cognate base pare is replaced by B
5) Individual base-pare contributions are independent and therefore additive
The loss in the binding affinity in one position may be gained in the other position.
Sites have binding affinity in a limited range E around a requred level E
E
In such set of sites
the local contribution
from every positions
must sum to E
l
What is the frequency
f lB
with wich certain base pair B
apeares at a certain position
in a site?
The same question is askeb in
statistical mechanics:
S independent particles in a system
and a given total energy E.
What a probability to that the particle
lB will have the energy  lB
?
1
2
f lB ( E ) 
n

1
4 ql
e
  lB
- is determined by the density of potential sites, i.e. by the number
of possible sequence combinations that have the required
descrimination energy E
obs
 lB  ln( f l obs
f
0
lB )
For any sequence X of the length s the actual
discrimination energy:
s
E ( X )    lBl 
l 1
s
obs
obs
ln(
f
f
l0
lBl )
 
1
l 1
Small-sample effect
nlB  1
f lB 
N 4
 nl 0  1 

 lB   ln 
 nlB  1 
1
Problems:
1.
2.
3.
4.
Small sets of sites
Homology between sites
Specific function of nucleotides in certain positions
Correlations between positions (not additive effect)
TFS identification
L
 L
min 
q    I (i ) f i ,bi   I (i ) f i 
i 1
 i 1

L
max
I
(
i
)
f

i
i 1
with: bi, nucleotide b found in the i-th position of test sequence,
fbi, frequency of nucleotide b in the i-th position of the aligned training sequences,
fimin, minimum frequency in position i,
fimax, maximum frequency in position i,
and
I (i ) 
f
i,B
B{ A,T ,G ,C }
ln( 4 f i , B ),
i  1,2,..., L
Calculating the Ci-values

 100  
Ci i   
Pi, B  ln Pi, B   ln 5 


 ln 5   BA,C ,G ,T , gap

A
C
G
T
gap
P
P
P
P
P
(A)
(C)
(G)
(T)
(gap)
Position
1
1
1
1
1
1
2
2
2
1
0
0
3
0
0
0
5
0
4
0
0
5
0
0
5
4
1
0
0
0
6
0
3
2
0
0
7
0
0
0
5
0
8
1
4
0
0
0
9
5
0
0
0
0
10
2
0
3
0
0
11
2
2
1
0
0
0.2
0.2
0.2
0.2
0.2
0.4
0.4
0.2
0
0
0
0
0
1
0
0
0
1
0
0
0.8
0.2
0
0
0
0
0.6
0.4
0
0
0
0
0
1
0
0.2
0.8
0
0
0
1
0
0
0
0
0.4
0
0.6
0
0
0.4
0.4
0.2
0
0
Ci (A)
Ci (C)
Ci (G)
Ci (T)
Ci (gap)
 P(B)*lnP(B)+ln(5)
-0.32
-0.32
-0.32
-0.32
-0.32
0.00
Ci
0
-0.37
0
0 -0.18
0
0
0
0 -0.37 -0.37
-0.37
0
0 -0.32 -0.31
0
0
0
0
-0.37
-0.32
0
0
0
-0.37
0
0
0 -0.31 -0.32
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.55 1.61 1.61 1.11 0.94 1.61 1.11 1.61 0.94 0.55
34
100 100
69
58
100
69
100
58
34
Scoring of the match
To make it fast
Preselection with the core:
Position
A
C
G
T
-
1
1
1
1
1
1
Ci 0
2
2
2
1
0
0
3
0
0
0
5
0
4
0
0
5
0
0
34 100
5
4
1
0
0
0
100 69
T
core
G
A
6
0
3
2
0
0
7
0
0
0
5
0
8
1
4
0
0
0
58 100
69
C
T
9
5
0
0
0
0
10
2
0
3
0
0
11
2
2
1
0
0
100 58
34
TRANSFAC: MatchTM tool
TRANSFAC: MatchTM output
Selection of optimal cut-offs
100
90
80
70
60
undeprediction error
50
overprediction error
40
error sum
30
20
10
0
0,75
minFN
0,8
0,85
0,9
minSUM
0,95
minFP
1
Example of a search using cut-offs to minimize false negative matches
In this example we searched the homo
sapiens angiotensinogen gene (5`region and
exon1) for all bindings sites listed in the features
of its Genebank entry. For that search we used
cut-offs to minimize false negative matches
as these cut-offs are recommended to
reduce the probability that Match misses a
potential binding site. Corresponding hits for all
of the entries in the feature table, which concern
a binding site, could be found in the Match
output.
Matrix-Identifier
Position
Feature table of Genebank entry
Core
Similarity
Matrix
Similarity
Sequence
Factor Name
Corresponding hits found by Match
TRANSPLORER (TRANScription exPLORER) is a software package for the analysis of transcription regulatory sequences.
Currently, TRANSPLORER site prediction tool uses position weight matrices (PWM) collections. It is able to use several matrix
sources: the largest and most up-to-date library of matrices derived from TRANSFAC® Professional database, other matrix
libraries as well as any user-developed matrix libraries. This means that it provides an opportunity to search for a great variety of
different transcription factor binding sites. A search can be made using all or subsets of matrices from the libraries.
Search for most probable binding sites regulating gene expression
Search for binding sites coinciding with SNPs
TRANSCompel®
a database on composite regulatory elements
Key topics
•pairs of closely situated binding sites for TFs;
•cooperative functioning of transcription factors;
•direct protein-protein interactions;
•combinatorial regulation of gene transcription.
individual entry
Description of an evidence
(experiment, cell type, two
individual interactions)
Link to the
TRANSFAC GENE
table
Link to the
EMBL
Link to the
TRANSFAC
FACTOR table
TRANSCompel®
combinatorial regulation, more than 360 CEs
N
1.
Gene
IgH , Mus
musculus
2.
IL-2, Homo
sapiens
Scheme of CE
Ets
-283
:
-268
:
NFAT
3.
4.
-167
:
IL-2, Homo
sapiens
-167
:
IgH ,
Homo sapiens
6.
Serum amyloid
А1, Rattus norv.
7.
IRF-1, Mus
musculus
AP-1
-142
:
AP-1
5.
AP-1
-142
:
NF-B
Il-2, Mus
musculus
AP-1
Ets
Oct-2
CBF
-117
:
-73
:
NF-B
C/EBP
-123
:
STAT-1
-113
:
-49
:
-40
:
NF-B
TRANSCompel®
functional classification of the composite elements
inducible/inducible
- Ca2+ and PKC response
- IFN-gamma and TNF-alpha response
NFAT / AP1
NF-kappaB / IRF
inducible/constitutive
- cholesterol level response
- acute-phase response
SREBP / Sp1
STAT-3 / Sp1
inducible/tissue-restricted
- TGF-beta response in B-cells
SMAD / AML
tissue-restricted/tissue-restricted
- pancreas islet beta-cells (insulin-producing)
HNF3 / BETA2
- pituitary gonadotropes
Ptx1 / SF-1
tissue-restricted/ubiquitous
- macrophages
PU.1 / Sp1
Inducible/inducible
19 CE‘s ETS / AP-1 providing cross-coupling of
Ras/Raf- and PKC-dependent signalling pathways;
15 CE‘s NFATp / AP-1 providing cross-coupling of
Ca2+ - and PKC-dependent signalling pathways;
Tissue-specific
32
Inducible
44
Cell-cycle
dependent
Dev. stagedependent
Ubiquitous
constitutive
F1
F2
14 CE‘s NF-B / C/EBP
NF-B is inducible by IL-1 and TNF-; C/EBP is
inducible by IL-6.
119
1
2
39
Tissuespecific
2
3
60
Inducible
2
Cell-cycle
dep.
12
Dev. stagedependent
Ubiquit.
constitut.
Inducible/constitutive
9 CE‘s ETS / Sp1
ETS factors are inducible through Ras/Raf- dependent
signalling pathway;
5 CE‘s Smad / TEF3
Smads are inducible by TGF- signalling.
Tissue-specific
32
Inducible
44
Cell-cycle
dependent
Dev. stagedependent
Ubiquitous
constitutive
F1
F2
119
1
2
39
Tissuespecific
2
3
60
Inducible
2
Cell-cycle
dep.
12
Dev. stagedependent
Ubiquit.
constitut.
Inducible/tissue-restricted
CE‘s Pit-1 / AP-1
Pit1 is pituitary-restricted transcription factor whereas AP-1
and Ets are ubiquitous inducible factors;
Tissue-specific
32
Inducible
44
Cell-cycle
dependent
Dev. stagedependent
Ubiquitous
constitutive
F1
F2
119
1
2
39
Tissuespecific
2
3
60
Inducible
2
Cell-cycle
dep.
12
Dev. stagedependent
Ubiquit.
constitut.
TRANSCompel®
antagonistic type of CEs
SRF mediates the rapid, transient
induction of the c-fos protooncogen
by serum growth factors.
human c-fos
SRF
acaggaTGTCCATATTAGGacatctgcg
YY-1
YY1 diminishes both basal and
serum-induced expression of the cfos.
Antagonistic composite elements
COMPEL: C00006
Chicken embryonic -globin gene
Sp1
Sp1 cooperatively with NF-Y activates
transcription
in primitive erythroid cells
NF-Y
GGTGGGcctccggagtgaccaatgagtgTGGACAGATGCCA
NF-1
NF-1 represses transcription
in adult cells
COMPEL: C00009
Human c-fos protooncogene
SRF mediates the rapid, transient induction of the cfos protooncogen by serum growth factors.
SRF
acaggaTGTCCATATTAGGacatctgcg
YY1 diminishes both basal and
serum-induced expression
YY-1
of the c-fos.
COMPEL: C00054
Rat serum amyloid A1 gene
C/EBP
NF-B
C/EBP and NF-B synergistically
activate transcription in liver cells
during acute phase response
TGGTAGTCTTGCACAGGAAATGACATggtGGGACTTTCCCcaggg
YY-1
YY1 represses inducible transcription of this
gene.
Catch®
pattern-based search for potential
composite elements in DNA sequences
• All CE‘s are used as individual searching patterns;
• Several parameters are available restricting the search:
mismatches in the site 1 and site 2,
distance between two sites,
composite score
TRANSCompel®
CEs of similar structure can be used to construct models
1. matrix rule
Set of CEs
CCACCCATTTCCTC
ACAGGAATgacctggtgcCTCGCCC
TTCCTCctgtgccttag...ctgtttttctaaCCGCCC
M1
M2
qM1 > n1
qM2 > n2
GAAGGGCGGGGAcagtt...aagcaaaaAAAGGGAACTGA
AAAGGGAACTGAgtggctgcgaaAGGGTGGGG
GGAAgcaaccagCCCACCA
CCGGAAGCaaccagCCCACC
aaAAGGAAGTGGGCGTGGTttaaag
2. distance rule
rules
3. orientation rule
ACTTCCTC...GGCTCCTCCTCC
Search for the potential CEs
in 100 000 bp
Application of CE models for promoter analysis
180
promoters -350/+50
160
exons_3d
h_chr_15(whole)
140
120
h_chr_15_Alu
h_chr_15_L1
100
h_chr_15_L2
80
60
40
20
0
Myb/Aml
NF-kB/Sp1
Ets/Sp1
E2F/Sp1
Four CE types are over-represented in promoters in
comparison with several biological sequences tested.
Gene expression profiling
GENE ONTOLOGY
TM
TRANSGENOME
TRANSGENOME provides the hierarchical structure of the most important elements of a genome in coding regions as well as in regulatory
regions. This structure provides the possibility to have a unique reference sequence and to store the location of all gene regulatory and
structural elements.
TF binding
sites
Composite
elements
Regulatory
regions
Genome Reference
Sequence
Gene
Repeats
S/MARs
Transcripts
Splicing variants
Polypeptides
TRANSFAC derived
start of transcription
(by relative site positions)
site
RefSeq derived
potential starts of
transcription (first
exons)
Gene
pre-mRNAs
(from RefSeq)
DBTSS derived
start of transcription
EPD derived
start of transcription
spliced mRNAs
CDS
5’UTR
3’UTR
Bronchial tree and Intrapulmonary Airways
Human body
Lung
Bronchial tree
Main bronchus
Lobar bronchus
Segmental bronchus
Bronchus
Bronchiolus
Terminal
bronchiolus
Alveolar sac
Pulmonary
alveolus
Alveolar
pore
Alveolar
epithelium
Pneumocytes
Cytomer/Content
Respiratory
bronchiolus
Alveolar
duct
Alveolar
septa
CYTOMER structure
Species
ID
Name
CP
Cell
ID
Name
Description
Organ
ID
Name
Parent
HUB
ID
Cytomer_no
Organ_ID
Cell_ID
System_ID
Period_ID
Species_ID
System
ID
Name
Period
ID
T1
T2
Stage
Stage2Period
Stage_ID
Period_ID
ID
T1
T2
description
ID
TFacc
Cytomer_no
Cacc
CP
Transfac
Factor
ID
Acc
CN
ID
TFacc
Cytomer_no
Cacc
CN
CYTOMER®
A database on gene expression sources
UniGene
EST
TRANSGENOME
Gene expression
group 1
Gene expression
group 2
Gene expression
group 3
The gene expression space
Expression space E
x
Factors controlling
transcription:
TRANSFAC
Expression space Eg of gene g
(spatial axis:
systems,
organs,
cells)
c
(conditional)
t
(temporal,
developmental axis)
Conditional
determinants:
TRANSPATH
Spatio-temporal
coordinates:
CYTOMER
Gene expression profiling
Expression pattern
of gene g1
Expression matrix:
g1
g2
:
gh
-rows representing genes
-columns representing samples
(various tissues, developmental stage,...)
x1,2
:
E2
x2,1
x2,2
:
.. ES
.. xS,1
xS,2
.. :
x1,h
x2,h
.. xh,h
E1
x1,1
Expression profile
of state E1
(e.g. in organ O
at stage t)
Gene expression profiling
Expression state
Gene
Gene expression profiling