Download Fast Mining Frequent Patterns with Secondary Memory

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Database model wikipedia , lookup

Transcript
Fast Mining Frequent Patterns with Secondary Memory
Kawuu W. Lin
Sheng-Hao Chung
Sheng-Shiung Huang
Dept. Computer Science and
Information Engineering
National Kaohsiung University of
Applied Sciences
Kaohsiung, Taiwan
Dept. Industrial Engineering and
Management
National Chiao Tung University
Hsinchu, Taiwan
Dept. Computer Science and
Information Engineering
National Kaohsiung University of
Applied Sciences
Kaohsiung, Taiwan
[email protected]
[email protected]
[email protected]
Chun-Cheng Lin
Dept. Industrial Engineering and
Management
National Chiao Tung University
Hsinchu, Taiwan
[email protected]
ABSTRACT
Data mining technology has been widely studied and applied in
recent years. Frequent pattern mining is one important technical
field of such research. The frequent pattern mining technique is
popular not only in academia but also in the business community.
With advances in technology, databases have become so large that
data mining is impossible because of memory restrictions. In this
study, we propose a novel algorithm called Hybrid Mine (HMine) to help improve this situation. H-Mine saves a part of the
information that is not stored in the memory, and through the use
of mixed hard disk and memory mining we are able to complete
data mining with limited memory. The results of empirical
evaluation under various simulation conditions show that H-Mine
delivers excellent performance in terms of execution efficiency
and scalability.
CCS Concepts
• Information systems ➝ Information systems applications➝
Data mining ➝ Association rules
Keywords
Data Mining; Frequent Pattern Mining; Main Memory; Disk
Storage.
1. INTRODUCTION
With technological developments and the convenience of
computers, human behavior for analysis can be stored as different
types of data. Although these data obey no laws at first glance,
using data mining techniques we can extract hidden information
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than the author(s) must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
[email protected].
ASE BD&SI 2015, October 07 - 09, 2015, Kaohsiung, Taiwan
Copyright is held by the owner/author(s). Publication rights licensed to
ACM.
ACM 978-1-4503-3735-9/15/10…$15.00
DOI: http://dx.doi.org/10.1145/2818869.2818933
from them. For example, in recent years the popular social
networking sites Facebook, Google+, and YouTube understand
the user’s preference by frequent pattern mining technology such
as fan pages, applications, club membership, etc. this provides
related information to direct the user or attract the user’s attention
to specific websites. By mining transaction data, businesses can
deduce customer buying habits and then use this information
effectively to increase profits. Data mining has been successfully
applied to various fields, fostering the ability to detect information
in vast datasets; this information can be subdivided into five parts,
the association rule, classification, clustering, sequential pattern,
and time sequence. Here we focus on the association rules.
In 1994, Agrawal Rakesh et al. first proposed the Apriori
algorithm [1] for mining association rules. Although Apriori is
effective for finding frequent patterns, its main shortcomings are
execution time and the memory required for storing 2-candidate
datasets. In 1999 Han Jiawei et al. [2] proposed a novel data
structure and mining algorithm, called the Frequent-Pattern tree
(FP-tree), along with a FP-Growth algorithm to improve the
Apriori shortcomings. This method can compress data and reduce
database scanning so that frequent pattern mining can run more
efficiently. With the rapid development of information
technology, databases are consequently increasingly large; the FPGrowth method has been unable to satisfy the expected efficiency
of the user. In recent years, many researchers have proposed
improvements; these can be divided into the following three
categories: First is the use of multiple computing resources to
improve the efficiency of mining such as QFP-growth [3], TPFP
[4], PFP-Tree [5], and FD-Mine [6]. Second is by reducing the
memory capacity such as FP-growth and CFP-Tree [7]. Third is
the use of the hard disk to store mining information such as
Database Projection [8], Aggressive Projection [9], DRFP-Tree
[10], and DSP-Tree [11].
Despite the recent algorithms showing good results, most of them
release an unestablished FP-tree when out of memory, then carry
out subsequent processing, this wastes a lot of time and
information. Here our proposed method, Hybrid Mine (H-Mine),
continues using the unfinished FP-tree when out of memory, then
builds the rest of the tree node on the hard disk to complete
mining. This algorithm uses mixed hard disk and memory mining
to help solve large data problems even with limited memory.
2. RELATED WORKS
2.1 FP-growth Algorithm
Han Jiawei et al. [2] proposed a tree-based data structure, the
Frequent Pattern Tree (FP-tree), and a corresponding mining
algorithm, the Frequent Pattern-growth (FP-growth), to mine
frequent patterns. This algorithm improves the Apriori
shortcomings, i.e., excessive numbers of candidates and a long
execution time.
The FP-growth algorithm begins by scanning the database; it then
creates a header table and calculates the frequency of occurrence
of each item. If the counts of individual items are less than the
support, then the item is discarded. A header table records all
frequent items, count of items, and the first appearance of the FPtree by a reference or pointer to each frequent item called Link.
aimed at improving distributed computing data transmission, as
between the various clouds this transmission relies on the Internet.
To reduce the amount of data transmitted FD-Mine uses a matrix
to retain the necessary FP-tree node information (Label, Count,
and Parent).
Here we propose the Hybrid Mine algorithm (H-Mine). It is based
on the FD-Mine algorithm and retains the necessary FP-tree node
information method. We have made a number of improvements.
The finished FP-tree can now retain the necessary node
information saved onto disk which then releases more memory,
making it available for large database mining application.
3. PROPOSED METHOD
In this section, we introduce the proposed Hybrid Mine algorithm
(H-Mine) and give details of its data structure.
The FP-growth algorithm finds frequent patterns after the FP-tree
has been created. It gradually selects items from the header table,
in accordance with the item Link, to create the conditions of a sub
FP-tree. Recursion results in the constant rebuilding of the sub
FP-tree, until it becomes a single path or an empty tree, then it
stops. This then continues recursively to the next Item, until each
item has been mined; the frequency patterns can then be obtained.
3.1 Memory warning mechanism
2.2 Database Projection Algorithm
3.2 Reserved node mapping disk mechanism
Han Jiawei et al. [8] improved the FP-growth algorithm [2]
shortcomings that resulted in big data mining errors. Because the
FP-tree is not fully built in the memory and it cannot perform FPgrowth mining when using a big database, problems are caused by
insufficient memory; therefore, they propose a Database
Projection algorithm to improve this. It is based on the framework
of FP-growth; when confronted with insufficient memory it
reduces database actions and attempts the FP-growth again. If it
still can't mine, then it continues to reduce the database actions
until the memory can accommodate mining and then completes
the investigation.
To quickly search the FP-tree node information on disk, we used a
directory to record the node information in a memory mapping
disk address table (the mapping table). For each node the initial
disk mapping address was -1. When a new node was added to the
hard disk, the present disk address was recorded in the memory
and the mapping table updated.
The Database Projection algorithm approach: if building the FPtree does not result in sufficient memory, then perform FP-growth
to obtain the frequent patterns, otherwise perform the Database
Projection algorithm. The Database Projection algorithm creates a
new sub-database for each frequent item and again executes
recursively until investigation is complete.
2.3 CARM Algorithm
Kawuu W. Lin and Der-Jiunn Deng proposed the CARM
algorithm [12], which involves novel parallel computing in the
cloud environment. CARM comprises two algorithms: the highworkability distributed FP-mine (HD-Mine) and the fastdistributed FP-mine (FD-Mine). The HD-Mine algorithm splits
the mined information into small pieces, stores it in computing
nodes then these nodes merge information to find the frequent
patterns. But this method can be very time consuming, so they
proposed FD-Mine to speed up the execution time. FD-Mine is
Index
Count
(max:n)
Disk address
(Childnode.data)
…
Index
Java provides a memory management interface and warning
alerts, so we directly implemented these. An alert is sent when the
memory usage reaches a value pre-set by us. We were able store
the FP-tree node information on the hard disk after this alert and
still keep the part of the memory space for use in subsequent
mining.
3.3 Disk information structure quick search
and tree-building
In order to continue building the FP-tree when the memory
warning occurs, we added two files to store node information:
nodeToSeek.data and ChildNode.data. The data structure of
nodeToSeek.data, which records each node’s index, is shown in
Figure 1. ChildNode counts the disk addresses of ChildNode.data
and displays three columns as a group; when creating a new node
index this is then added as one new group. The first column uses
one space for the storage node index. The second column uses one
space to store the current number of the ChildNode. The third
column uses n spaces to store the disk address of the
ChildNode.data. If the count of second column is more than n, it
opens a new nodeToSeek2.data file to store, and so on.
The data structure of ChildNode.data is shown in Figure 2. These
records contain each node's index, label and the next sub-node
ChildNode, and display three columns as a group. When creating
a new node index this is added as one new group, each column
using one space to store the information.
Count
(max:n)
Disk address
(Childnode.data)
…
Figure 1. The data structure of nodeToSeek.data
Index
…
Index
Label
ChildNode
Index
Label
ChildNode
…
Index
Figure 2. The data structure of ChildNode.data
Node
label
count
parent
Node
label
count
parent
Node
label
…
Figure 3. The data structure of TreeNodeInDisk.data
3.4 Storage FP-tree node in the disk
information structure
After the memory warning occurred, we stored the node label,
count and parent data to a TreeNodeInDisk.data file such as
CARM [12] that records tree nodes. As shown in Figure 3,
TreeNodeInDisk.data is recorded and contains each node, label,
count, and parent data. It displays four columns as a group and
when creating a new node adds it as one new group.
3.5 LINK Header Table in the disk
information structure
We added two attribute to the disk to support the Link Header
Table when memory warning occurred. The first attribute to
record file Next.data stores the Link disk address, which is
connected to the node index of TreeNodeInDisk.data. The other
attribute, NextsInDiskCOUNT, stores an amount of the Next.data
in order to execute fast searches. The Link disk address in the
mining FP-growth algorithm has an initial value of zero. When
creating a new Link in the Next.data this then changes the amount
of Header Table in NextsInDiskCOUNT attribute.
3.6 H-Mine Algorithm
The complete algorithm of H-Mine is shown in Figure 4. The
input is a transaction database D, minimum support S and
percentage of reserved memory space X%. The output is in the
form of frequent patterns FP. First, we obtained the maximum
available memory capacity to enable the calculation of the upper
limit of the memory warning value (line 2 and line 3). We then
scanned the Database to produce C1, filtered support to obtain the
Header Table, established the node mapping table, and initialized
the Tree (line 4 to line 7). The database was scanned again to
build the Tree, each added node was assessed to see whether or
not it exceeded the upper warning value limit; if found to be
above this limit then the information is stored on disk, elsewhere
than the working memory (line 8 to line 19). After completing the
Tree, mining the FP-growth algorithm reveals Frequent Patterns
(line 21). Note: if the memory warning notice occurs during
mining FP-growth, similarly use the memory and disk hybrid
approach to store the sub-tree to finish the investigation.
Algorithm: Hybrid-Mine (H-Mine)
Input: Database D, minimum support S, the using
percentage of reserved memory space X%.
Output: Frequent Patterns FP.
1. Procdure H-Mine(D, S, X%){
2. AvailableMemory =
getMaxOfAvailableMemoryCapacity();
Warning = AvailableMemory * X%;
C1= ScanDB(D);
HT = getHT (C1, S);
MappingNodeInDisk(HT);
// Establish the mapping table of node.
7. Tree = Ø;
8. For( d = each transaction of D ){
9. freqitem = filter(d, HT);
10. For( Node = each item of freqitem ){
11. If( getCurrentMemory() < Warning ){
12.
Tree = Tree ∪ BuildTreeInMemory(Node);
13. }Else{
14.
DiskPostion = BuildTreeInMemory(Node, Tree);
// Get the disk address of newly added node.
15.
UpdateMappingNodeInDisk(Node,
DiskPostion); // Update the mapping table.
16.
Tree = Tree ∪ BuildTreeInDisk(Node);
17. }
18. }
19. }
20. FP = FP-growth(Tree, HT);
21. Return FP;
22. }
3.
4.
5.
6.
Figure 4. H-mine Algorithm.
4. EXPERIMENTAL RESULTS
4.1 Experimental Setup
To estimate the performance of the proposed method H-Mine, we
used IBM’s Quest synthetic data generator [13] to generate the
workload and also experimented with real data. All experiments
were run on a PC with an Operating System as follows: Windows
7 Enterprise service pack 1, Intel(R) Core(TM) i7-4790 CPU
@3.60GHz and 1TB disk storage. The algorithm was
implemented in Java. For verification of the overall procedure
time, we chose the FP-growth [2] and Database Projection (DP)
[8] for comparison because of their method of solutions for large
datasets.
4.2 Varying Support base data
The experimental data was generated using IBM’s Quest
Synthetic Data Generator [13], and the data is based on the
general transaction records databases. The average transaction
length (T), average frequent itemset length (I), number of items
(N), and number of transactions (D) were 20, 10, 10K, and 1000K,
respectively. This was the basic dataset for our experiments. We
limited the set memory to 1 GB, reserved memory space to 95%
and observed the efficiency of FP-growth, DP and H-Mine by
varying the support. To evaluate the execution times of FP-growth,
DP and H-Mine, we lowered the support continually to find out
the exact threshold where the ‘out-of-memory’ error occurs and
the FP-Tree can no longer fit into the main memory. We found
out that when support was set to 0.5%, this simulated the above
scenario, so we set the experimental support range from 0.3% to
0.7% for comparison. The experimental results showed that HMine performed better than the FP-growth and DP algorithms in
terms of execution time, as summarized in Table 1 and Figure 5.
Table 2. Experimental data T40I10D100K.data
T40I10D100K.data
Time(s)
FP-growth
DP
H-Mine
0.8
Fail
752.315
286.435
0.9
Fail
734.075
273.337
1
Fail
714.771
265.264
5
42.935
42.935
42.935
7
4.143
4.143
4.143
Support(%)
Table 1. Experimental data showing varying support
T20I10N10KD1000K
Time(s)
FP-growth
DP
H-Mine
0.3
Fail
612.39
433.76
0.4
Fail
417.535
177.262
Support(%)
0.5
39.161
39.161
39.161
0.6
8.761
8.761
8.761
0.7
5.22
5.22
5.22
T40I10D100K.data
800
FP-growth
DP
H-mine
600
Time(s)
400
200
0
T20I10N10KD1000K
700
600
0.8
FP-growth
DP
H-mine
500
0.9
1
5
7
Support(%)
Time(s)
400
300
Figure 6. T40I10D100K.data effect on execution time.
200
100
4.4 Webdoc.data by varying support
0
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Support (%)
Figure 5. Varying support, effect on execution time
experiment.
4.3 T40I10D100K.data by varying support
The T40I10D100K.data [14] dataset was generated by the IBM
Almaden Quest research group. The average transaction length
(T), average frequent itemset length (I), and number of
transactions (D) were 40, 10, and 100K, respectively. We limited
the set memory to 500 MB, the reserved memory space to 95%
and observed the efficiency of FP-growth, DP, and H-Mine by
varying the support. The experimental results are shown in Table
2 and Figure 6.
The Webdoc.data [15] dataset was generated from Claudio
Lucchese et al. The file size, average transaction length (T),
number of items (N), and number of transactions (D) were 1.48
GB, 70 K, 5000 K, and 1600 K, respectively. We limited the set
memory to 1 GB, the reserved memory space to 95% and
observed the efficiency of FP-growth, DP, and H-Mine by varying
the support. The experimental results are summarized in Table 3
and Figure 7.
Table 3. Experimental data Webdoc.data
Webdoc.data
Time(s)
FP-growth
DP
H-Mine
21
Fail
8614.025
6686.069
22
Fail
6875.271
4610.803
23
Fail
3826.958
2854.743
24
152.555
152.555
1789.202
25
60.428
60.428
60.428
Support(%)
[4]
webdoc.data
10000
[5]
FP-growth
DP
H-mine
8000
Time(s)
6000
[6]
4000
2000
[7]
0
20
21
22
23
24
25
26
Support(%)
[8]
Figure 7. Webdoc.data effect on execution time.
5. CONCLUSIONS
Identifying valuable frequent pattern hidden in large databases is a
fundamental task in association rules mining. Past research has
been committed to the use of multiple computing resources, data
structure compression, or hard disk processing; however, we
observe that these methods will not be sufficient in the future as
the amount of data available continues to grow rapidly. Therefore,
we propose an efficient hybrid method, using mixed hard disk and
memory mining, to solve big data problem even with limited
memory. It can be seen from this experiment that when dataset
size increases rapidly, the execution time of the H-Mine algorithm
increases but the curve remains steady. For future work, we intend
to further improve the efficiency of this algorithm by combining it
with cloud computing technology through various nodes.
6. ACKNOWLEDGMENTS
Part of this work was supported by the National Science Council
of Taiwan, R.O.C., under grant MOST 103-2221-E-151 -033 -.
7. REFERENCES
[1]
[2]
[3]
Agrawal, Rakesh, and Ramakrishnan Srikant, 1994, “Fast
algorithms for mining association rules”. Proceedings of the
20th International Conference on Very Large Data Bases
vol. 1215 pp. 487-499.
Jiawei Han, Jian Pei, R. Mao, Yiwen Yin, 2000, “Mining
Frequent Patterns without Candidate Generation.”, In Proc.
of the ACM SIGMOD Int. Conf. on Management of Data,
pp.1-12.
Qing-song xie. Yong Qiu, Yongjie Lan, 2004, “An
improved algorithm of mining from FP- tree.” Proceedings
of the Third International Conference on Machine Learning
and Cybernetics, pp. 26-29, August.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Jiayi Zhou, Kun-Ming Yu, 2008, “Tidset-based Parallel FPtree Algorithm for the Frequent Pattern Mining Problem on
PC Clusters.”, Advances in Grid and Pervasive Computing,
Lecture Notes in Computer Science, vol. 5036, pp 18-28.
Asif Javed, Ashfaq Khokhar, 2004, “Frequent pattern
mining on message passing multiprocessor systems”,
Distributed and Parallel Database 16 (3) pp. 321–334. ACM
New York, NY, USA ©2000
Kawuu Weichieng Lin, Yu-Chin Lo, 2013, “Efficient
algorithms for frequent pattern mining in many-task
computing environments”, Elsevier Knowledge-Based
Systems 49, pp. 10–21.
Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner,
2011, “Memory-Efficient Frequent-Itemset Mining”,
EDBT/ICDT ‘11 Proceedings of the 14th International
Conference on Extending Database Technology pp. 461472, ACM New York, NY, USA.
Jiawei Han, Jian Pei, R. Mao, Yiwen Yin, 2004, “Mining
frequent patterns without candidate generation: a frequentpattern tree approach”, Journal of Data Mining and
Knowledge Discovery, vol.8, no.1, pp.53-87.
G. Grahne, J. Zhu, 2004, “Mining Frequent Itemsets from
Secondary Memory”, International Conference on Data
Mining, pp. 91-98, November.
Muhaimenul Adnan , Reda Alhajj, 2007, DRFP-tree: diskresident frequent pattern tree, Springer Science Business
Media, LLC.
Alfredo Cuzzocrea, Carson K. Leung, Juan J Cameron,
2013, “Stream Mining of Frequent Sets with Limited
Memory”, AC ‘13 Proceedings of the 28th Annual ACM
Symposium on Applied Computing, pp. 173-175, ACM
New York, NY, USA.
Der-Jiunn Deng, Kawuu W. Lin, 2010, “A novel parallel
algorithm for frequent pattern mining with privacy
preserved in cloud computing environments”, International
Journal of Ad Hoc and Ubiquitous Computing pp. 205-215
Agrawal, Rakesh, Ramakrishnan Srikant, Quest Synthetic
Data Generator, IBM Almaden Research Center, SanJose,
California.
Fabrizio Silvestri, Paolo Palmerini, Raffaele Perego,
Salvatore Orlando, 2001, “DCI: A hybrid Algorithm for
Frequent Set Counting”, High Performance Computing
Lab. at ISTI-CNR., University of Venice.
C. Lucchese, F. Silvestri, G. Tolomei, R. Perego, S.
Orlando, 2011, Identifying task-based sessions in search
engine query logs , Proceedings of the Forth International
Conference on Web Search and Web Data Mining, WSDM
2011, ACM, pp. 277-286.