Download Incremental Interactive Mining of Constrained Association Rules

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
ACM SAC 2005 – Santa Fe, New Mexico
Incremental Interactive Mining of
Constrained Association Rules from
Biological Annotation Data
Imad Rahal, Dongmei Ren, Amal Perera, Hassan
Najadat and William Perrizo North Dakota State
University, USA
Riad Rahhal, University of Iowa, USA
Willy Valdivia, Orion Intregrated Biosciences, USA
ACM SAC 2005 – Santa Fe, New Mexico

High throughput techniques are producing massive quatities of
boiinformatics data

Consequently, there is a need for analysis methodologies that scale to larger
and larger datasets.

In this paper we us Association rule mining (ARM) to discover
relationships in Saccharomyces cerevisiae (Yeast) genomic data.

ARM was 1st proposed for Market Basket Research (MBR)

ARM comes into its own when much of the data is categorical or
where there are a very large number of dimensions.

However, ARM has been noted for producing a large number of rules,
which can overwhelm researchers

Frequent itemset mining (1st step in ARM) also provides indexing for
attributes that appear often, for faster access to information.
ACM SAC 2005 – Santa Fe, New Mexico

We propose a new ARM technique which

Optimizes the rule-discovery process by giving biologists the
flexibility of incorporating their knowledge into it,

Reduces the overwhelming number of rules that match the
specified minimum support and confidence thresholds,

Operates in an incremental and interactive mode,
Allows new queries to be posed from old ones; interactive mining
Uses previous results to answer new queries; incremental mining

Stores and processes data vertically
ACM SAC 2005 – Santa Fe, New Mexico
Data Representation



Data used was extracted mostly from
the MIPS database (Munich
Information center for Protein
Sequences)
Left column shows all considered
features (feature groups)
Right column shows the number of
distinct feature values in the extent
domain of each feature
Feature
Total Values
pathway
80
EC
622
complexes
316
function
259
localization
43
protein class
191
phenotype
181
interactions
6347
ACM SAC 2005 – Santa Fe, New Mexico
Data Representation

We built a Binary gene-by-feature table.

For a categorical feature, we consider each category as a separate
attribute or column by bit-mapping it.


For numeric attributes and hierarchical categorical attributes, we used a bit
vector for each bit position or hierarchy level (reducing the number of bit
vectors by ~ log(n)
The resulting table has a total of



8039 distinct feature bit vectors (corresponding to “items” in MBR) for
6374 yeast genes (corresponding to transactions in MBR)
For processing and storage optimization, we use Predicate tree (P-tree)
patent pending technology to vertically store and process the resulting bit
vectors
Current practice: Structure data into
horizontal records. Process vertically (scans)
Base 2
Base 10
Scanned
vertically
2
6
2
2
5
2
7
7
7
7
7
7
2
2
0
0
Top-down construction
of the 1-dimensional
Ptree representation of
R11, denoted, P11, is
built by recording the
truth of the universal
predicate “pure 1” in a
tree recursively on
halves (1/21 subsets),
until purity is achieved.
6
6
5
5
1
1
1
1
1
0
1
7
4
5
4
4
=
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
pure1? false=0
pure1? true=1
0
0
0
0
1
0
1
1
Horizontally
structured
records
R[A1] R[A2] R[A3] R[A4]
A2 A 3 A4 )
R11
R(A1
Predicate tree technology: vertically project each attribute,
then vertically project each bit position of each attribute,
then compress each bit slice into a basic Ptree.
e.g., compression of R11 into P11 goes as follows:
pure1? false=0 pure1? false=0
pure1? false=0
010
011
010
010
101
010
111
111
111
111
110
111
010
010
000
000
110
110
101
101
001
001
001
001
001
000
001
111
100
101
100
100
R11 R12 R13 R21 R22 R23 R31 R32 R33
0
0
0
0
1
0
1
1
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
Horizontally AND basic Ptrees
1. Whole is pure1? false  0
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43
2. Left half pure1? false  0
P11 And it’s pure0000so00 10 0 00 0 1 00 1 00 0 00 1 00 0 00 0 01 0 01 0 00 00 0
3. Right half pure1? false  0
10 10
10 01
01 01
0100
0
01
1 01 0001
branch ends 10
0
^ 01
^ 10
^
^
^ 01
^
^^
^
^ 10
01
01
01
01
10
4. Left half of rt half ? false0
0 0
5. Rt half of right half? true1
01
6. Lf half of lf of rt? true1 But it is pure
1
10
For categorical attributes, a bitmap is formed for
(pure0) so this
each category then compressed into a P-tree.
7. Rt half of lf of rt? false0 branch ends
Top-down construction of basic P-trees is best for understanding, but bottom-up is much more efficient.
R11
P11
0
0
0
0
1
0
1
1
Bottom-up construction of 1-Dim, P11, is done using in-order tree traversal and the collapsing of pure siblings, as follow:
0
0
0
0
1
0
1
1
0
0
0
0
0
0 0
0 0
1
0
R11 R12 R13 R21 R22 R23 R31 R32 R33
1 0 1 1
bottom up construction of
choice for images)
1
1
1
1
0
1
1
1
0
1
0
0
1
0
1
1
1
1
1
1
0
0
0
0
1
1
1
1
1
1
0
0
1
1
0
1
0
0
0
0
1
1
1
1
0
0
0
0
2-Dimensional Ptrees
1
1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
R41 R42 R43
0
0
0
1
1
1
1
1
0
0
0
1
0
0
0
0
1
0
1
1
0
1
0
0
(eg, natural dim
bit-file (e.g., hi-order bit of Green band): 1111110011111000111111001111111011110000111100001111000001110000
Which, in spatial raster order is:
11
11
11
11
11
11
11
01
11
11
11
11
11
11
11
11
11
10
11
11
00
00
00
00
00
00
00
10
00
00
00
00
0
1
Ptree using 2-Dim Peano order.
0
0
0 0 1 0
1 1 0 1
1 1 1 0 0 0 1 0 1 1 0 1
0
ACM SAC 2005 – Santa Fe, New Mexico
Mining The Yeast Genome

A scientist interested in investigating the effect of one subset of the
features over another, such as the effect of phenotype on function would


Perform a join on the two sets of frequent itemsets and produce a new set
containing all frequent itemsets combining the two features


Mine the frequent itemsets from the phenotype and function feature
values separately (produce two independent sets of frequent itemsets)
We assume the antecedent is to come from one feature set and the
consequent from the other, thus, each frequent itemset will produce at
most one rule (if the confidence of that rule is high enough).
All subsequent queries that include phenotype and/or function would
benefit from the frequent itemset mining already done.
ACM SAC 2005 – Santa Fe, New Mexico
The Mining Algorithm

Input:
 Rule query


minisupp and miniconf
Step 1: Mining of FISs from Individual Features
 Relevant feature F, mine all frequent itemsets from F-values separately
 Using P-trees: Support of an itemset containing items F1 and F2 is
just
PF1 and PF2
 Perfom the ROOTCOUNT operation on the result
Because of the independent treatment of the feature,
mining them involved is done in parallel
Step 2: Joining of Feature FISs
 After separately mining all the frequent itemsets from the
items of all selected features, we perform a join step


ACM SAC 2005 – Santa Fe, New Mexico
The Mining Algorithm
Exploits down closure property of support with respect to
itemset size

any itemset must have support greater than or equal to the
support of any of its supersets and thus no itemset can be
frequent unless all of its subsets are also frequent
E.g., phenotypefunction: If the join of two frequent itemsets
Iphenotype and Ifunction is a non-frequent itemset then there is no
need to join Iphenotype or any of its supersets with Ifunction or any
of its supersets
ACM SAC 2005 – Santa Fe, New Mexico
The Mining Algorithm


Step 3: Producing Strong Rules
 No enumeration of different rules that could be derived from a frequent
itemset is needed (second step in traditional ARM)
 Note: computing the confidence of a rule is also efficient using P-trees:
confidence of a rule AC is ROOTCOUNT(PAC) /ROOTCOUNT (PA)
Step 4: After the user examines the returned rules, s/he often wishes to
issue a related but slightly different query.
 This can be viewed as the start of the interactive mode
 Such new queries typically involve features that have already been
included in previous query.
 Our approach would incrementally build on the results obtained so far
to answer the new query
ACM SAC 2005 – Santa Fe, New Mexico
The Mining Algorithm


For example, suppose that the user submits: “localizationfunction”
after “phenotype function” , all that needs to be done is to mine
frequent itemsets from localization and join them with function
If a new query, “localization, phenotypefunction”, is submitted, we
utilize the all frequent itemsets from the first request and join them
those derived from localization.
ACM SAC 2005 – Santa Fe, New Mexico
Algorithmic Details

For the generation of FISs, we
utilize a previous P-tree ARM
approach [Rahal, Denton, Perrizo
JIKM Journal Dec. 2004 [13] and
store them in a (frequent) Set
Enumeration (SE) tree containing
all frequent itemsets


Ø
Cell cycle
defects
Stress response
defects
Cell cycle
defects
Sensitivity to
antibiotics
Stress response
defects
a)
Ø
a) example (frequent) SE for
function
Metabolism
b) example (frequent) SE for
phenotype
Transcription
Energy
Metabolism
b)
ACM SAC 2005 – Santa Fe, New Mexico
Experimental Study


Implementations coded in C++ and executed on an Intel Pentium-4 2.4GHz
processor workstation, 2GB RAM, Redhat Linux 9.0. All implementations use
P-tree API http://midas.cs.ndsu.nodak.edu/~datasurg/ptree
For our approach, we computed the total time for executing 5, 10, 15, 20
and 25 consecutive inter-related queries


We compare with the standard approach (mine over all attribute values)


Each query contains up to 3 features and uses at least one feature from
a previous query
we only include the time needed to mine the whole dataset without the
time needed to scan the resulting set of rules for the subset of interest
We set the min. conf. threshold to 90% and varied
the min. supp. threshold between 0.05% and 20%
ACM SAC 2005 – Santa Fe, New Mexico
700
600
500
400
300
200
100
0
5 queries
10 queries
15 queries
20 queries
25 queries
Brute Force
20
.0%
15
.0%
10
.0%
5.
9%
5.
0%
1.
0%
0.
0%
0.
25
%
0.
12
5%
0.
1%
0.
05
%
Time (s)
Execution Time
Support (%)


The figure clearly shows the gain achieved by using our approach
The post-processing approach needs more than 620 seconds at
5.9% support threshold
ACM SAC 2005 – Santa Fe, New Mexico
Frequent Itemsets
Number of Frequent Itemsets
1200000
5 queries
1000000
10 queries
800000
15 queries
600000
20 queries
400000
25 queries
200000
Brute Force
0.
1%
0.
05
%
20
.0
%
15
.0
%
10
.0
%
5.
9%
5.
0%
1.
0%
0.
0%
0.
25
0. %
12
5%
0
Support (%)

Biologists could go to very low support thresholds and mine
frequent itemsets (and eventually rules) that would go undetected
in the post-processing approach
ACM SAC 2005 – Santa Fe, New Mexico
Number of Rules
1000000
Rules
800000
5 queries
10 queries
15 queries
20 queries
25 queries
Brute Force
600000
400000
200000
05
%
0.
1%
0.
%
0.
12
5
25
%
0.
0%
0.
0%
1.
0%
5.
9%
5.
.0
%
10
.0
%
15
20
.0
%
0
Support (%)


The brute-force approach returned slightly less than a million rules at
support 5.9% most of which are irrelevant to the queries we’ve selected
For our queries, interesting rules started to show up at support ~ 0.5%
 For high support, mostly uninteresting & evident (trivial) rules appeared
 Here is where our results associated the yeast eIF2B factor
with specific interactions within the cellular complex.
ACM SAC 2005 – Santa Fe, New Mexico


A significant portion of the rules were straight forward in the sense of
providing only common knowledge, e.g., complex=cytoplasmic ribosomal
large subunit  localization=cytoplasm
Of significant interest to our biological colaborators was a set of rules
pertinent to the yeast eukaryotic initiation factor 2B (eIF2B)




“complex = eIF2B (5 ORFs)”“function = ribosome biogenesis”
A multi-sub-unit guanine nucleotide exchange factor which catalyzes the
exchange of GDP bound to initiation factor eIF2 for GTP, generating
active eIF2-GTP. In humans, it is composed of five subunits, alpha, beta,
delta, gamma and epsilon
In yeast, the eIF2B factor mediates the exchange of a series of proteins
bound to translation initiation, the process preceding formation of the
peptide bond between the first two amino acids of a protein.
In specific, it catalyzes a vital regulatory step in the
initiation of the translation of mRNA
System
(DataMIMEtm data mining, NO NOISE)
http://www.cs.ndsu.nodak.edu/~datamine
YOUR DATA MINING
YOUR DATA
Data Integration Language
Ptree (Predicates) Query Language
DIL
PQL
Internet
DII (Data Integration Interface)
DMI (Data Mining Interface)
Data Repository
lossless, compressed, distributed, verticallystructured database
ACM SAC 2005 – Santa Fe, New Mexico
Conclusion

In this paper, we proposed a computational approach
targeted at the analysis of the yeast genome annotation data


It gives biologists the flexibility of incorporating domain
knowledge, in the form of queries, thus aiding in focusing
their analysis on specific features of interest.
It optimizes the rule-discovery process by allowing operation
in the interactive and incremental modes and enables
 parallel processing
 reuse of mined results
 Vertical, efficient storage and processing
ACM SAC 2005 – Santa Fe, New Mexico
Future Directions


Extend the features in our analyzed data such as to include
secondary protein structure information
We also aim to pursue similar analysis over different
genomes such as the human genome

A broader goal is to look for “inter-organism” association rules valid
across organisms rather than on a single organism