Download Computational Survey of Putative Bidirectional Promoters

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Mauli Prasad
Primary Advisor: Dr.Qunfeng Dong
Secondary Advisor : Dr.Haixu Tang
School of Informatics, Indiana University, Bloomington.
© 2002 by Bruce Alberts, Alexander Johnson, Julian
Lewis, Martin Raff, Keith Roberts, and Peter Walter.
http://stemcells.nih.gov/StaticResources/info/scireport/images/figurea6.jpg
© 2002 by Bruce, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter.
Head to Head
GENE 2
5’
3’
Head to Head
3’
5’
3’
5’
3’
5’
GENE 1
Head to Tail
For a promoter to be called BIDIRECTIONAL
it should satisfy two conditions [1]
1. Adjacent genes should be in the Head
to Head orientation
2. Their transcription start sites should
be not more than 1000bp apart
5’
3’
3’
5’
Tail to Tail
5’
3’
3’
5’
Promoters of many co-expressed Bidirectional
Gene Pairs are capable of initiating transcription
in both directions. [Human Genome ] Trinklein et
al (2004) [2]
 Compared to Tail to Tail, Head to Head gene
arrangement is more conserved. [Vertebrates].
Yang et al (2008) [3]
 Co-expression of adjacent gene pairs [Yeast]. S.
Kruglyak and H. Tang (2000) [4]
 Orientation affects co-expression of neighboring
genes. [Arabidopsis thaliana] Williams et al
(2004) [5]

Search and Analysis possible bidirectional promoters in Arabidopsis Thaliana
Arabidopsis thaliana (Wall cress/Mouse-ear cress)
 Model Organism for plants
 Herbaceous dicot (Brassicaciae family)
Plants of economic importance –
Cabbage, Broccoli, Turnips, Mustard, Rapeseed



Mining adjacent Gene pairs
Microarray data could suggest co-regulation
If the gene pairs are co-expressed
1. What is the most prevalent intergenic distance?
2. Any common Motifs ?
3. Identification of Transcription factor.
4. Head to Head vs the rest.
5. Distance conservation and orientation patterns in
Brassica rapa
Dataset - Gene annotation in Gff format from The Arabidopsis Information Resource
ftp://ftp.arabidopsis.org/home/tair/Genes/TAIR8_genome_release
Head to Head
Head to Tail
Tail to Tail
All Pairs within 500bp
Remove
Pseudogenes, Transposons, RNA’s
Duplicates bl2seq – evalue cutoff 1e-5
Head To Head
Head to Tail
Tail to Tail
1369
3807
2674
Dataset - Pre-processed expression data for 22810 probe sets on the Affymetrix
Arabidopsis ATH1 (25K) array across 1436 hybridization experiments.
ftp://ftp.arabidopsis.org/home/tair/Microarrays/analyzed_data/affy_data_1436_10132005.zip
•
•
Start with 1436 Affymetrix Arabidopsis 25K arrays obtained from NASCArrays
and AtGenExpress.
Normalize the data using the robust multi-array average (RMA) method.
Match probes to the gene pairs obtained
For each pair calculate the correlation coefficient
Plot % gene pairs against its correlation coefficient
Based on appropriate cut-off for correlation coefficient select Highly Coexpressed gene pairs.
No: Pairs matching probes
H_H
H_T
T_T
Non-Adjacent
842
2367
1642
624
Cor. Pairs
%Cor. Pairs
55
80
43
45
6.5
3.3
2.6
0.9
H_H
[H_T]+[T_T]
>=60%
55
122
<=60%
787
3887
H_T
[H_H]+[T_T]
>=60%
79
98
<=60%
2288
2386
T_T
[H_H]+[H_T]
>=60%
43
134
<=60%
1599
3075

We want to test if the Highly
Co-expressed genes
significantly correlated to
the H_H (potentially
containing a bi-directional
promoter).

The test is used to examine
the significance of the
association between two
variables in a 2 x 2
contingency table.

Here the Sample is divided
into H_H and non H_H (the
1st variable) vs. Highly Coexpressed gene pairs and
the remaining gene pairs
(the 2nd variable).
Fishers Exact
P-Value
6.20E-06
Fishers Exact
P-Value
0.2836
Fishers Exact
P-Value
0.005855
Categories
1
0-50
2
50-100
3
100-150
4
150-200
5
200-250
6
250-300
7
300-350
8
350-400
9
400-450
10
450-500
•If the intergenic distance distribution in Highly Co-expressed gene pairs vary
significantl y from gene pairs having Low Co-expression
• Leave one out technique was used to see which one of the distance categories
contributed more.
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
1
2
3
4
5
6
7
8
9
10
>=60
<=60
>=60
<=60
>=60
<=60
H_H
H_T
T_T
All
0.0009866
0.5259
0.3219
0-50
0.0006511
0.5844
0.2452
50-100
0.0006366
0.4407
0.2894
100-150
0.3647
0.6722
0.491
150-200
0.00038
0.5111
0.2689
200-250
0.001641
0.4542
0.2722
250-300
0.002395
0.4269
0.623
300-350
0.0004404
0.4951
0.2666
350-400
0.0011
0.4406
0.3198
400-450
0.0007382
0.5026
0.2819
450-500
0.002538
0.6561
0.22958
6%
0%
2%
8%
Unkn Mol Func
25%
Enzyme Activity
Binding
38%
21%
Transporter
Transription Factor
Struct Molecule
Other Mol Funct
Total = 126
3% 2%
3% 3%
Unkn Mol Func
25%
Enzyme Activity
Binding
34%
Transporter
30%
Transription Factor
Struct Molecule
Other Mol Funct
Total = 174
3%
4% 3% 2%
Unkn Mol Func
16%
Enzyme Activity
Binding
39%
33%
Transporter
Transription Factor
Struct Molecule
Other Mol Funct
Total = 118
Intergenic regions of highly co-expressed pairs in Head to Head
was provided to MEME with the following parameters
Any number of repetitions of the motif was allowed
E-value cutoff 0.1
E= 8.6e-037
E= 3.6e-007
E= 1.9e-002
Regulate gene expression during initiation of
axillary bud outgrowth in Arabidopsis
Ascorbate oxidase gene (AO) promoter; Found in
silencer region; AOBP (AGTA repeat binding
protein) has DOF domain required for repression
of expression of AO gene.
Light responsive element (LRE) found in the
parsley (P.c.) CHS-1 (chalcone synthase-1) gene
promoter.
No: pairs with Pairs Without
Total Number
Motif 1
Motif 1
Head to Head
20
35
55
Head to Tail
12
67
79
Tail to Tail
3
40
43
H_H
[H_T]+[T_T]
#enriched
20
#not enriched
35
H_T
15
107 0.0004077
[H_H]+[T_T]
#enriched
12
23
#not enriched
67
75
T_T
#enriched
#not enriched
Fishers Exact
P-value
[H_T]+[H_H]
3
32
40
102
Fishers Exact
P-value
0.1883
Fishers Exact
P-value
0.0277
Position Specific Probability
Matrix from MEME was provided
to TESS along with intergenic
regions of highly correlating
gene pairs in all orientations.
AT1G09760-AT1G09770
AT5G23080-AT5G23090
AT5G64670-AT5G64680
AT2G40650-AT2G40660
AT3G46030-AT3G46040
AT1G23280-AT1G23290
AT1G76400-AT1G76405
AT5G05670-AT5G05680
AT2G20480-AT2G20490
AT3G56990-AT3G57000
[protein binding, response to cold]-[DNA binding, transcription factor activity,
regulation of transcription-defense response signaling pathway]
[RNA binding, RNA processing ]-[intracellular, transcription factor activity,
regulation of transcription ]
[ribosome, structural constituent of ribosome, translation, ribosome biogenesis
and assembly ]-[ribosome, structural constituent of ribosome, translation,
ribosome biogenesis and assembly ]
[ binding, RNA processing]-[tRNA binding, tRNA aminoacylation for protein
translation]
[nucleus, DNA binding, nucleosome assembly, nucleosome ]-[structural
constituent of ribosome, translation, cytosolic small ribosomal subunit ]
[MAK16 protein-related]-[Encodes a ribosomal protein L27A, a constituent of
the large subunit of the ribosomal complex]
[endoplasmic reticulum, oligosaccharyl transferase activity, protein amino acid
glycosylation ]-[similar to chloroplast channel forming outer membrane protein
[Pisum sativum] (GB:CAB58442.1)]
[endoplasmic reticulum, signal recognition particle binding ]-[nuclear pore
complex protein-related;]
[ similar to Os09g0446000 [Oryza sativa (japonica cultivar-group)]
(GB:NP_001063306.1)]-[Cajal body, nucleolus, RNA binding, polar nucleus
fusion ]
[EDA7 (embryo sac development arrest 7)]-[nucleolar essential protein-related]
[glycolate oxidase activity, electron transport, metabolic process ]-[chloroplast thylakoid lumen,
serine-type peptidase activity, trypsin activity, proteolysis, photosystem II repair ]
AT4G18360-AT4G18370
[protein serine/threonine phosphatase activity]-[chloroplast, 3-deoxy-7-phosphoheptulonate
synthase activity, aromatic amino acid family biosynthetic process, chorismate biosynthetic process ]
AT4G33500-AT4G33510
AT4G35440-AT4G35450
[membrane, voltage-gated chloride channel activity, chloride transport]-[protein folding, defense
response to bacterium, incompatible interaction, protein targeting to chloroplast, integral to
chloroplast outer membrane ]
[chloroplast thylakoid membrane, structural molecule activity]-[shikimate kinase-related]
AT2G35490-AT2G35500
AT2G37310-AT2G37320
AT1G13030-AT1G13040
AT1G14270-AT1G14280
AT3G16990-AT3G17000
AT1G04070-AT1G04080
[pentatricopeptide (PPR) repeat-containing protein]-[pentatricopeptide (PPR) repeat-containing
protein]
[unknown,sphere organelles protein-related; similar to hypothetical protein [Brassica rapa]
(GB:ABQ50545.1); contains domain PTHR15197 (PTHR15197)]-[pentatricopeptide (PPR) repeatcontaining protein]
[prenyl-dependent CAAX protease activity ]-[Encodes phytochrome kinase substrate 2. PKS proteins
are critical for hypocotyl phototropism. ]
[TENA/THI-4 family protein; Identical to Seed maturation protein ]-[ubiquitin-protein ligase activity]
[P-P-bond-hydrolysis-driven protein transmembrane transporter activity, protein targeting to
mitochondrion ]-[regulation of timing of transition from vegetative to reproductive phase, ]
[ P-P-bond-hydrolysis-driven protein transmembrane transporter activity, protein targeting to
mitochondrion ]-[ribosome, structural constituent of ribosome, translation ]
AT1G27390-AT1G27400
Dataset  Finished BAC’s of Brassica rapa in FASTA format
ftp://149.155.100.41/pub/brassica/KBr_finished.fasta
 Protein sequences of the Highly Correlating Genes
Arabidopsis
ftp://ftp.arabidopsis.org/home/tair/Sequences/blast_datasets/TAIR8_blasts
ets/TAIR8_pep_20080412
Blastall  [Program – tblastn]
 [Database -Brassica BAC’s]
 [Query - Arabidopsis Protein Sequences]
 [E-value cutoff - 1e-20]
Head to
Head
Head to
Tail
Tail to Tail
Different
BAC’s
Total
Head to
Head
16
1
0
13
30
Head to
Tail
0
19
0
23
42
Tail to Tail
0
0
6
19
25





Percentage of highly correlating pairs more
in Head to head
Highly co-relating pairs in Head to Head fall
within 100-150bp
Head to Head pairs mostly RNA/DNA/Protein
binding
UP1ATMSD motif enriched in Head to Head
Orientation seems to be conserved in B.rapa
but intergenic distance seems to have lower
conservation
1.
2.
3.
4.
5.
Adachi N, Lieber MR: Bidirectional gene organization: a common
architectural feature of the human genome. Cell 2002,
109(7):807-809
Trinklein, Nathan D., Aldred, Shelley Force, Hartman, Sara J.,
Schroeder, Diane I., Otillar, Robert P., Myers, Richard M. An
Abundance of Bidirectional Promoters in the Human Genome
Genome Res. 2004 14: 62-66
Yang MQ, Taylor J, Elnitski L: Comparative analyses of
bidirectional promoters in vertebrates. BMC Bioinformatics 2008,
9 Suppl 6:S9
Kruglyak, Semyon., Tang, Haixu. Regulation of adjacent yeast
genes . Trends in Genetics 2000 , 16 (3):109-111.
Williams, Elizabeth J.B., Bowles, Dianna J.Coexpression of
Neighboring Genes in the Genome of Arabidopsis thaliana.
Genome Res. 2004 14: 1060-1067

Dr.Qunfeng Dong

Dr.Haixu Tang

Ashwini Oke

Linda Hostetter