Download www.cs.wayne.edu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Promoter Discovery: A
Correlation Mining Approach
Yi Lu
Department of Computer Science
Wayne State University
Outline
Introduction
 Related Work
 Problem Definition
 Correlation Mining
 Conclusion and Future work

Yi Lu
Wayne State University
2
Introduction


Central Dogma
Gene Expression
Transcription
Translation
RNA
Protein
DNA
Yi Lu
Wayne State University
3
Introduction


The promoter region (a set of transcription
binding sites) of the gene acts as light switch. It
signals when to turn the gene on and off.
We are interested in the relationship between the
promoter region and gene expression. i.e. what
kind of binding sites determine whether a gene is
expressed or not?
Yi Lu
Wayne State University
4
Introduction - Microarray
Microarray chips
Images scanned by laser
Gene
D26528_at
D26561_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
H29189_at
G29183_at
Value
193
70
318
1764
1537
1204
707
899
9210
Datasets
D1 D2 D3 D4……..
D26528_at
Gene
Day 1 Day 2
Day 3
…
D26528_at
193
4157
556
D26561_at
70
11557
476
D26579_at
318
12125
498
D26598_at
1764
8484
1211
D26599_at H21219
1537
3537
131
D26600_at
1207
4578
94
D28114_at
707
2431
209
…….
D26561_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
…..
..
Yi Lu
Wayne State University
5
Introduction

Transcription factor binding sites (motif) in
promoter region should “explain” changes
in transcription.
R(t1)
R(t2)
t2 Motif
t1 Motif
AGCTAGCTGATTGTGCACACTGATCGAG
CCCCACCATAGCTTCGTTGTGCGCTATA
TATTGTGCAGCTAGTAGAGCTCTGCTAG
AGCTCTATTTGTGCCGATTGCGGGGCGT
CTGAGCTCTTTGCTCTTTTGTGCCGCTT
TTGATATTATCTCTCTGCTCGTTTGTGC
TTTATTGTGGGGGTTGTGCTGATTATGC
TGCTCATAGGAGATTGTGCGAGAGTCGT
CGTAGTTGTGCGTCGTCGTGATGATGCT
GCTGATCGATCGTTGTGCCTAGCTAGTA
GATCGATGTTTGTGCAGAAGAGAGAGGG
TTTTTTCGCGCCGCCCCGCGCTTGTGCT
CGAGAGGAAGTATATATTTGTGCGCGCG
CCGCGCGCACGTTGTGCAGCTGATGCAT
GCATGCTAGTATTGTGCCTAGTCAGCTG
CGATCGACTCGTAGCATGCATCTTGTGC
AGTCGATCGATGCTAGTTATTGTTGTGC
GTAGTAGTGCTTGTGCTCGTAGCTGTAG
Yi Lu
Time Course
Wayne State University
genes
6
Related work


Cluster gene expression profiles
Search for motifs in promoter regions of clustered
genes
Promoter regions
AGCTAGCTGATTGTGCACAC
AGCTAGCTGATTGTGCACAC
TTCGTTGTGCGCTATATAGA
TTCGTTGTGCGCTATATAGA
TTGTGCAGCTAGTAGAGCTC
TTGTGCAGCTAGTAGAGCTC
clustering
CTAGAGCTCTATTTGTGCCG
CTAGAGCTCTATTTGTGCCG
ATTGCGGGGCGTCTGAGCTC
TTTGCTCTTTTGTGCCGCTT
TTTGCTCTTTTGTGCCGCTT
Motif
Yi Lu
Wayne State University
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
TTGTGC
7
Related work

Clustering

partition the N genes to a set of disjoint
groups so that the expression profile of
genes in same group have high
similarity to each other and the
expression profile of genes in different
groups are dissimilar to each other.

Most widely used algorithms: K-means
clustering, hierarchy clustering
algorithms.

Genetic K-means algorithms (Lu et al.
2003, 2004).
Yi Lu
Wayne State University
8
Related work

Motif discovery after clustering

given a set of upstream sequence of genes which are coexpressed, find subsequences that are overrepresented
and are significant to be separated from other
subsequences

MEME, Gibbs Sampling, Winnower algorithms.

PDC algorithm (Lu et al. 2006)

Usually have high false positive rate
Yi Lu
Wayne State University
Genes
ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAG
CCAAT
CTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACT
GATAC
TCGACTGC
CCAAT
GATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCG
CCAAT
ATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCT
GATAC
TCGACTGC
CCAAT
TCGACTGC GATAC
AGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGA
GCAGTT
CTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAG
CCAAT
CATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATC
TCGACTGC
GCAGTT CCAAT
GATAC
TCGACTGC
CCAAT
GCAGTT
GTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG
9
Motivation

Researches have indicated that multiple
transcription factor binding sites are
involved into each transcription process.
This lead us to study the Modules (a pair
of motifs) instead of Motifs.
Yi Lu
Wayne State University
10
Motivation
Not all genes contain the same motif
cause the same gene expression change.
 Not all genes with same gene expression
change contains same motif.

ATCTTGTGCACATGTACTAC
Gene 1
AGCTAGTTGTGCACACACTT Gene 2
AATTTCGTTGTGCGCTATAT
Gene 3
GAGCTCTTGTGCAGCTAGTA Gene 4
TTCGTTGTGCGCTATATAGA
Gene 5
TTGTGCAGCTAGTAGAGCTC Gene 6
Yi Lu
CTAGAGCTCTATTTGTGCCG
Gene 7
ATTGCGGGGCGTCTGAGCTC
Gene 8
TTTGCTCTTTTGTGCCGCTT
Gene 9
Wayne State University
11
Problem Definition
Gene
Mm.100117
Mm.100118
Mm.100125
Mm.10154
Mm.10174
Mm.10178
Mm.10182

ETSFETSF
NFKBSTAT
1
0
0
0
0
0
1
…
STATETSF
0
0
1
0
1
0
0
0
0
0
1
0
0
1
Day6
…
Day0
Day3
16.75
65.3 119.15
150.85 137.55 130.55
84.55
96.9 119.15
84.55
96.9 119.15
223.05 181.55
200.9
16.75
65.3 119.15
79.6
80.3
94.75
Given a list of genes, and corresponding module present
information, gene expression information, find the
relationship between module and gene expression, i.e.
which modules or module combinations may relate to the
gene expression change.

M1 M2 => increase gene expression change from Day 1 to Day 4
Yi Lu
Wayne State University
12
Method - Quantify Gene Expression
Days
Mm.116803
Days
1
4
8
11
14
18
21
26
29
60
189.9
398.3
224.1
123.4
602.7
2218
8624
9901
11748
18519
21-26
26-29
29-60
1-4
4-8
8-11
11-14
14-18
18-21
log10(Di+1/Di)
0.322
-0.25
-0.26
0.689
0.566
0.59
0.06
0.074
0.198
Mean
0.014
0.006
0.006
0.017
0.04
0.063
0.052
0.019
0.044
Lower Bound
-0.110
-0.15
-0.12
-0.23
-0.22
-0.165
-0.225
-0.22
-0.32
Upper Bound
0.138
0.165
0.132
0.269
0.297
0.291
0.328
0.258
0.410
1
0.8
0.6
0.4
0.2
0
-0.2
Day1-4 Day4-8 Day8- Day11- Day14- Day18- Day21- Day26- Day2911
14
18
21
26
29
60
-0.4
-0.6
-0.8
Yi Lu
Wayne State University
13
Method - Quantify Gene Expression
Days
Mm.116803
1
4
8
11
14
18
21
26
29
60
189.9
398.3
224.1
123.4
602.7
2218
8624
9901
11748
18519
Days
Ei=log10(Di+1/Di)
E1
E2
E3
E4
E5
E6
E7
E8
E9
1-4
4-8
8-11
11-14
14-18
18-21
21-26
26-29
29-60
0.322
-0.25
-0.26
0.689
0.566
0.59
0.06
0.074
0.198
Lower Bound
-0.110
-0.15
-0.12
-0.23
-0.22
-0.165
-0.225
-0.22
-0.32
Upper Bound
0.138
0.165
0.132
0.269
0.297
0.291
0.328
0.258
0.410
Mm.116803
E1
E2
E3
E4
E5
E6
E7
E8
E9
+
-
-
+
+
+
0
0
0
Yi Lu
Wayne State University
14
Method – Generate Frequent Module Set

Frequent module sets (occurrence >=2)
M1(4), M2 (3), M3 (2) , M4(1)
M1M2 (3), M1M3 (2) , M2M3 (1)
M1M2M3(1)
M1
M2
M3
M4
Gene 1
1
1
0
0
Gene 2
1
0
1
0
Gene 3
1
1
1
0
Gene 4
1
1
0
1
Yi Lu
Wayne State University
15
Method – Generate Frequent Gene
Expression Set

Frequent gene expression sets (occurrence >=2):
E1+ (2), E1- (0), E2+ (1), E2-(3), E3+ (0), E3-,(2),
E1+E2-(1), E1+E3-(1), E2-E3- (2)
E1
E2
E3
Gene 1
+
+
0
Gene 2
0
-
Gene 3
+
Gene 4
0
E1+
E2+
E3+
E1-
E2-
E3-
Gene 1
1
1
0
0
0
0
-
Gene 2
0
0
0
0
1
1
-
-
Gene 3
1
0
0
0
1
1
-
0
Gene 4
0
0
0
0
1
0
Yi Lu
Wayne State University
16
Correlation Measure – Contingency Table

The relation between u and v in the pair
(u,v)
Yi Lu
Wayne State University
17
Liddell Measure

E1+
^E1+
M2
O11=2
O12=1
R1 = 3
^M2
O21=0
O22=1
R2 = 1
C1= 2
C2 = 2
N=4
Liddell = ( 2*1-1*0)/(2*2) = 0.5
Yi Lu
Wayne State University
18
Method – Correlate Module Set with
Gene Expression Set



Minimize module
set
Maximize gene
expression set
Minimum Liddell
value is set to
0.5/-0.5, then the
result sets:



M2 ->E1+
M2 -> ^(E2- E3-)
M3 ->E2- E3-
E1+
E2-
E3-
E2-E3-
M1
0
0
0
0
M2
0.5
-0.3333
-0.5
-0.5
M3
0
0.66667
1
1
M1M2
0.5
-0.3333
-0.5
-0.5
M1M3
0
0.66667
1
1
M1
M2
M3
M4
E1
E2
E3
Gene 1
1
1
0
0
+
+
0
Gene 2
1
0
1
0
0
-
-
Gene 3
1
1
1
0
+
-
-
Gene 4
1
1
0
1
0
-
0
Yi Lu
Wayne State University
19
Result on Spermatogenesis

Spermatogenesis is the biological process related to
formation of sperm. Two gene expression data sets are
downloaded from GEO (Gene Expression Omnibus).

The time course of one dataset ranges from day 0, 3, 6, 8,
10, 14, 18, 20, 30, 35, and 56. And the other ranges from
day 1, 4, 8, 11, 14, 18, 21, 26, 29, and 60.
0.6
Concondance
0.5
0.4
0.3
0.2
0.1
0
0.5
0.6
0.7
0.8
Liddell
Yi Lu
Wayne State University
20
System Workflow




GEO: Gene
Expression
Omnibus
DBTSS: DataBase
of Transcriptional
Start Sites
TRANSFAC: the
Transcription Factor
database
JASPAR: The highquality
transcription factor
binding profile
database
GEO
cDNA
Gene IDs
Expression Data
DBTSS
Gene Expression
Clustering
Upstream Sequences
Clustered Genes
Motif Discovery
Motifs K-SPMM
Motif
TRANSFAC
Matrices
JASPAR
Modules
Correlation Mining
of Modules
Yi Lu
Wayne State University
21
Conclusion

Not only same module combination result, but
also the same genes that contain the module
combinations have been pulled out between the
two datasets.

The promoter detected using our approach
statistically shows significance than random
generated datasets.

Some promoters found by our approach are
confirmed by literatures.
Yi Lu
Wayne State University
22
Future work

The concordance between the two gene
expression datasets downloaded from GEO
are low, new method to reconcile the
difference between two data sets is
needed.

Motifs found by different algorithms are
overwhelming, we may incorporate the
weight matrix and gene ontology to
identify the significant ones.
Yi Lu
Wayne State University
23
References

Gene Expression Clustering:



Motif Discovery:


Yi Lu, Shiyong Lu, Farshad Fotouhi, Yan Sun and Zijiang Yang, “PDC: Pattern Discovery with
Confidence in DNA Sequences”, In the proceedings of the IASTED International Conference
on Advances in Computer Science and Technology (ACST 2006), Puerto Vallarta, Mexico,
January, 2006
Motif Extraction, Module Integration:



Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng and Susan Brown, "FGKA: A Fast Genetic
K-means Clustering Algorithm", in Proceedings of the 19th ACM Symposium on Applied
Computing, Nicosia, Cyprus, March, 2004.
Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan Brown, “Incremental Genetic
K-means Algorithm and its Application in Gene Expression Data Analysis”, International
Journal of BMC Bioinformatics, 5(172), October, 2004.
Adrian E. Platts, Yi Lu, Stephen A. Krawetz, “K-SPMM, an Online System for Data Mining
Regulatory Elements from Murine Spermatogenic Promoter Sequences”, presented in 2006
Great Lakes Mammalian Development Meeting, Toronto, March 3-5 2006.
Yi Lu, Adrian E. Platts, Charles G. Ostermeier, Stephen A. Krawetz, “A Database of Murine
Spermatogenic Promoters Modules & Motifs”, Submitted to Journal of BMC Bioinformatics
for publication.
Correlation Mining:


Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, "Correlation Mining to
Reveal the Regulation of Transcription Factor Binding Site Modules", 4th Great Lake
Bioinformatics Retreat, Frankenmuth, Michigan, August, 2005.
Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, “Mining of Correlation
Between Transcription Binding Sites and Gene Expression Profiles”, In preparation.
Yi Lu
Wayne State University
24
Yi Lu
Wayne State University
25
Acknowledgements
Dr. Shiyong Lu
 Dr. Stephen Krawetz
 Mr. Adrian Platts
 Dr. Jeffrey Ram
 Dr. Youping Deng

Yi Lu
Wayne State University
26
Questions?
Yi Lu
Wayne State University
27