Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Peano Count Trees and Association Rule Mining
for Gene Expression Profiling using
DNA Microarray Data
Dr. William Perrizo, Willy Valdivia, Dr. Edward Deckard, Francis Larson;
North Dakota State University
{william.perrizo, willy.valdivia, edward.deckard, francis.larson @ndsu.nodak.edu}
Patents pending on bSQ and Ptree technology
The Problem
•There is a lot of data available today (e.g., gene expression
data), but too little information.
•Data Mining attempts to reduce raw data to information for
decision support.
Decisions (often 1 bit – Y/N, T/F, Do/Don’t_do )
•Data mining
•Classification (supervised learning)
•Clustering (unsupervised learning)
•Association Rule Mining (ARM)
•Statistics
•Machine Learning
•Data Structuring
•Signal Processing
0/1
raw data (gigs, teras, petas, exas…)
A Solution?
Currently the predominant method employed in bioinformatics is
clustering (a little classification) on isolated microarray datasets.
• Needed:? A data mining software suite able to:
• transform copies of pertinent data from a variety of databases into a
data mining-ready form in real-time (our solution based on P-trees?)
“transform copies” rather than “standardize” since standardization rarely
works! There will always be an MS (and I don’t mean Martha Stewart) to
frustrate/destroy the standardization effort.
• facilitate Association Rule Mining, Clustering, Classification in an
uniform way (so data mining results from other areas can be used)
Bioinformatics: a Walmart or a Kmart?!?
Walmart took DM seriously (early, comprehensive approach
borrowing useful techniques from a variety of application areas)
Kmart? Too little, too late.
Using data mining techniques developed for
other application areas in bioinformatics?
Remotely Sensed Images (RSI) can be viewed as collections
of pixels. Each pixel has a value for each feature attribute
TIFF image
Yield Map
For example, the RSI dataset above has 1320 rows and 1320 columns of pixels (1,742,400
pixels) and 4 feature attributes (Red,Green,Blue,Yield). The (R,G,B) feature bands are in the
TIFF image and the Y feature is color coded in the Yield Map.
Microarray or DNA chip data is not much different (multiple attributes corresponding to
treatments or conditions). Much data mining (ARM) has been done on RSI data.
Can it be useful in bioinformatics?
Regulation Pathway Discovery is not very different
from Market Basket Research (ala Walmart)
The results of clustering microarray data may indicate that genes (1 – 9) are involved in
a regulation pathway.
High confident rule mining on that cluster can discover the relationships among those
genes (e.g., the expression of one gene, Gene2, might be discovered to be regulated by
1,3,5,6,8,9 and Gene4 and Gene7 may not be directly regulating Gene2 and can
therefore be excluded.
Clustering
ARM
Gene4
Gene1
Gene6
Gene1
Gene2, Gene3
Gene4, Gene 5, Gene6
Gene7, Gene8
Gene9
Gene7
Gene3
Gene8
Gene5
Gene9
Gene2
ARM for Microarray Data
• A gene regulatory pathway component can be represented as an association
rule, {G1..Gn} Gm where {G1…Gn} is the antecedent & Gm is the consequent.
• Microarray data is most often represented as a relation G(Gid, T1 …Tn) where
Gid is the gene identifier; T1... Tn are the treatments (or conditions) and the
data values represent gene expression levels. Call this the " Gene Table”.
• Currently, data-mining techniques concentrate on the Gene table - specifically,
on finding clusters of genes that exhibit similar expression patterns under
selected treatments (clustering the gene table).
Trmt-ID
Gene-ID
.
G1
G2
G3
G4
T1
….
….
….
….
T2
….
….
….
….
T3
….
….
….
….
T4
….
….
….
….
Gene
expression
values
ARM for Microarray Data (Contd.)
• An alternate data format exits (called the “Treatment Table”.)
T(Tid, G1, G2, …. , Gn) where Tid is the treatment identifier and
G1…Gn are the gene identifiers.
• Treatment table provides a convenient form for ARM of gene expression levels.
• Goal is to mine for rules among genes by associating treatment table columns.
GeneID
TrtmtID
.
T1
T2
T3
T4
G1
….
….
….
….
G2
….
….
….
….
G3
….
….
….
….
G4
….
….
….
….
Gene
expression
values
The form of the Treatment Table with binary values (coding only whether an
expression level exceeds or does not_exceed a threshold) is identical to Market
Basket Data, for which a wealth of Rule Mining techniques have been developed
in the last 8 years.
Gene Table
Treatment Table
G1 G2 G3 G4
T1 T2 T3 T4
G1 … …. … …
T1
… …. …. …
G2 … …. … …
T2
… …. …. …
G3 … …. … …
T3
… …. …. …
G4 … …. … …
T4
… …. …. …
Gene Table is usually given as a standard (MS excel)
spreadsheet of gene expression levels coming from
microarray experiements. It is a 2-D data cube which can
be rotated (to the Treatment Table), rolledup, sliced, diced,
drilled down, association rule mined etc.
What are Peano Trees? First what are the
Spatial Data Formats
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
Band SeQuential (2 files)
(BSQ)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
Band InterLeaved by Line
(BIL)
254 127 37 240
14 193 200 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
Band Interleaved by Pixel (1 file)
(BIP)
254 37 127 240
14 200 193 19
Spatial Data Formats (Cont.)
254
(1111 1110)
BAND-1
127
(0111 1111)
37
(0010 0101)
BAND-2
240
(1111 0000)
14
(0000 1110)
193
(1100 0001)
200
(1100 1000)
19
(0001 0011)
BSQ format (2 files)
BIL format (1 file)
BIP format (1 file)
Band 1: 254 127 14 193
Band 2: 37 240 200 19
254 127 37 240
14 193 200 19
254 37 127 240
14 200 193 19
bit SeQuential (bSQ) format (16 files) (related to bit planes in graphics)
B11 B12 B13 B14 B15 B16 B17 B18 B21 B22 B23 B24 B25 B26
1
1
1
1
1 1
1 0
0
0 1 0
0 1
0
1
1
1
1 1
1 1
1
1 1 1
0 0
0
0
0
0
1 1
1 0
1
1 0 0
1 0
1
1
0
0
0 0
0 1
0
0 0 1
0 0
B27
0
0
0
1
B28
1
0
0
1
Reasons of using bSQ format
–
–
–
Different bits contribute to the value differently.
bSQ format facilitates representation of precision hierarchy (1 bit, 2 bit, … n-bit precision).
bSQ format facilitates the creation of an efficient P-tree data structure and P-tree algebra.
BSQ and bSQ formats
– BSQ and bSQ are “tabular” formats
• BSQ consist of a separate table for each band (e.g., Gene or Treatment)
• bSQ consist of a separate table for each bit of each band
– One can view it this way:
• Data set is initially 1 relation or table, R(K1,..,Kk, A1, A2,…, An), K1,..,Kk are
structure attributes and each Ai is a feature attribute.
– Structure attributes of an RSI are X and Y coordinates (could put the same
structure on the Gene Table, but I want to focus on the Treatment table).
– Structure attributes of the Treatment Table might be a collection of Treatment
dimensions, based on MIAME standard (Minimum info about microarray exp):
http://www.mged.org/Annotations-wg/index.html
» Experimental design
» Array design
» Samples
» Hybridisations
» Measurements
» Normalization Control
A Universal Format?
E.g., One large universal table with 5 dimensions based on MIAME standard?
– E = Experimental design – Hybridisation Procedures
– A = Array design
– S = Samples
– M = Measurements
– N = Normalization Control for data mining across all treatments and genes?
"GREASMN" (5-D Universal Gene Expression Cube)
Gene-Rep
G1
G2
…
Gn
Tid
(E,A,S,M,N)
E,A,S,M,N1
E,A,S,M,N2
….
….
….
….
….
….
….
….
….
Gene expression
values
...
E,A,S,M,Nm
Cardinatlity is high, but compression will be substantial (next slide).
GREASMN datacube rolled up onto (E,S)
E (Lab…)
S (Organism..)
E1 E2
1 5 2 0…
90.
The non-zero blocks may occur off the diagonal.
The Point: Massive but very sparse dataset!
0 8 1 7 6 5...
zeros
70.
Sn
.
zeros
1 7 0...
.
.
.
.
En
Yeast
S1
S2
.
Peano Count Tree (P-tree)
P-tree represents spatial bSQ data bit-by-bit in a
recursive quadrant-by-quadrant arrangement.
P-tree is a lossless, compressed, data-miningready representation of the data.
– partially run-length compressed using the structure
attributes.
– “count pre-computed”.
1
1
1
1
1
1
0
0
1
1
1
1
1
0
0
0
1
1
1
1
1
1
0
0
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
1
0
1
1
1
1
1
1
1
An example of Peano Count tree
Given a bSQ file, Bij, (shown in spatial positions below) we create its basic PC-tree, Pij as follows.
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
16
8
15
3 0 4 1
4 4 3 4
1 1 1 0 0 0 1 0 1 1 0 1
Peano or Z-ordering
Level
Pure (Pure-1/Pure-0) quadrant Fan-out
Root Count
QID (Quadrant ID)
16
An example of PC-tree
001
111
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
11
11
11
11
00
00
00
10
11
11
11
11
55
0
16
1
Level-3
2
3
8
15
2
3 0 4 1
4 4 3 4
3
1 1 1 0 0 0 1 0 1 1 0 1
16
2.2.3
Peano or Z-ordering
Level
Pure (Pure-1/Pure-0) quadrant Fan-out
Root Count
QID (Quadrant ID)
( 7, 1 )
( 111, 001 )
10.10.11
Level-2
Level-1
Level-0
Alternative forms for Ptrees (all lossless)
1 means quadrant is pure-1, 0 otherwise
(pure0 if no sub-tree ptrs, otherwise mixed)
P1:
0
1 means quadrant is pure-0, 0
otherwise
P0:
______/ / \ \______
/
/ \
\
00
11
01
10
/
/
\
\
1
0
0
1
/ 01/ \10 \11 00/ 01/ \10 \11
00
0 0 1 0 11 0 1
//|\
//|\
//|\
1110
0010 1101
0
______/ / \ \______
/
/ \
\
/
/
\
\
0
0
0
0
/ / \ \
/ / \ \
0 1 0 0 0 0 0 0
//|\
//|\
//|\
0001 1101
0010
1 means quadrant is Not pure-Zero, 0
otherwise (Note: PM = PNZ XOR P1 )
PNZ (=P0’)
1
________ / / \ \___
/
____ / \
\
/
/
\
\
1
1
1
1
/ / \ \
/ / \ \
1 0 1 1
1 1 1 1
//|\
//|\
//|\
1110 0010
1101
Vector forms (A table entry for each mixed inode containing its qid and its children bit-vector ;
P1V (as a table):
qid
vector
[ ]
1001
[01]
0010
[10]
1101
[01.00]
1110
[01.11]
0010
[10.10]
1101
Since there is no
qid=[01.01] in the
table we know it’s
pure0, not mixed
P0V:
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
vector
0000
0100
0000
0001
1101
0010
Eliminate need for subtree pointers)
PNZV:
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
vector
1111
1011
1111
1110
0010
1101
Basic, Value and Tuple Ptrees
Basic Ptrees
(i.e., P11, P12, …, P18, P21, …, P28, …, P71, …, P78)
AND
Value Ptrees
(i.e., P1, 001 = P11’ AND P12’ AND P13)
AND
Tuple Ptrees
(i.e., P001, 010, 111 = P1, 001 AND P2, 010 AND P3, 111)
qid
[ ]
[01]
[10]
[01.00]
[01.11]
[10.10]
NZ
1111
1011
1111
P1
1001
0010
1101
1110
0010
1101
P11
qid
NZ
Distributed
[ ]
1010
[10]
1111
[10.11]
P12
P1
P trees?
1000
1110
0111
qid
[ ]
[01]
[10]
[01.11]
[10.00]
NZ
0111
1111
1110
P1 P13
0001
1110
0110
0110
1000
Assume a 5-computer cluster; NodeC, Node00, Node01, Node10, Node11.
Send to Nodeij if qid ends in ij:
Bp qid
11[ ]
12[ ]
13[ ]
NZ
1111
1010
0111
P1
C
1001
1000
0001
Bp qid
NZ
11[01.00]
13[10.00]
P1 00
1110
1000
Bp qid
11[01]
13[01]
Bp qid
11[10]
11[10.10]
12[10]
13[10]
P1 10
1101
1101
1110
0110
Bp qid
NZ
11[01.11]
12[10.11]
13[01.11]
NZ
1111
1111
1110
NZ
1011
1111
P1 01
0010
1110
P1 11
0010
0111
0110
A data mining request involves a series of multicast invocations and at most one unicast reply for each
receiving node.
A distributed Genomic data mining federation of Beowulf clusters? Each node computes only a tiny
portion of the necessary count information then sends to the requesting node?
Non-ARM Ptree-based Microarray data mining methods
Hierarchical
Clustering
Agglomerative
Supervised Learning or
Classification
Non-Hierarchical
Clustering
Divisive
1 2 3 4 5 6 7 8
K-clustering
SOM
8
7
8
6
5
6
7
6
…
5
PCA
SVM
Decision Trees
KNN
bSQ format: Bit files of intervalized, normalized,
Red/green ratios for each Microarray.
Ptree format: One P-tree for each bit position of
each bSQ file (e.g., the high-order bit)
55
____________/ / \ \___________
/
_____/ \ ___
\
16
____8__
_15__
16
/ / |
\
/ | \ \
3 0 4
1
4 4 3 4
//|\
//|\
//|\
1110
0010
1101
depth=0
level=3
depth=1
level=2
depth=2
level=1
depth=3
level=0
A plan
Temporal
Gene Exp.
Analysis
Spatial
Gene Exp.
Analysis
Genotypic
Gene Exp.
Analysis
Data
Repository
bSQ
Ptrees
Development
Of Data Mining
Tools
User JAVA Graphical
Interface
SQL, XML
Other Microarray
Data Repositories
Stanford
EMBL
SGDB
Data Mining in Genomics: Conclusion
•Data Mining in application areas, with huge raw data stores such as
Market Basket Research, Remotely Sensed Imagery, and Genomics
(Proteomics?, Transcriptomics, Metabolomics?), are remarkably similar
in terms of data and data mining needs.
•There should be more collaboration across applications.
•In the application areas data cube rotation can open data mining possibilities.
•We suggest a universal data structure (GREASMN Table and P-trees)
•striped across a wide federation of computer nodes,
•using P-tree technology to facilitate data mining
•eliminate barriers introduced by scale limitations, incompatible
data formats, etc.