Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Toward Terabyte Pattern
Mining
An Architecture-conscious Solution
2007 ACM
PPoPP’07 March 14–17, 2007, San Jose, California, USA.
Copyright c 2007 ACM 978-1-59593-602-8/07/0003.
Gregory
Buehrer
The Ohio State
University
Srinivasan
Parthasarathy
The Ohio State
University
Tahsin
Kurc
The Ohio State
University
Shirish
Tatikonda
The Ohio State
University
Joel
Saltz
The Ohio State
University
OUTLINE
Instractution
Challenges
FPGrowth in Serial
Parallel Optimizations
Experiments
Related Work
Conclusion
2
Introduction






First,Many data mining algorithms are computationally complex
and often scale non-linearly with the size of the data set (even
if the data can fit in memory).
Second, when the data sets are large, the algorithms become
I/O bound.
Third, the data sets and the meta data structures employed
may exceed disk capacity of a single machine.
A natural cost-effective solution to this problem that several
researchers is to parallelize such algorithms on commodity
clusters to take advantage of distributed computing power,
aggregate memory and disk space, and parallel I/O capabilities.
However, to date none of these approaches have been shown
to scale to terabyte-sized data sets.
Moreover, as has been recently shown, many of these
algorithms are greatly under-utilizing modern hardware. Finally,
many of these algorithms rely on multiple database scans and
oftentimes transfer subsets of the data around repeatedly.
3
Introduction..






Our solution embraces one of the fastest known sequential
algorithms (FPGrowth), and extends it to work in a parallel setting,
utilizing all available resources efficiently.
In particular, our solution relies on an algorithmic design strategy
that is cache-, memory-, disk- and network- conscious, embodying
the comprehensive notion of architecture-conscious data mining.
A highlight of our parallel solution is that we only scan the data
set twice, whereas all existing parallel implementations based on
the well-known Apriori algorithm require a number of data set
scans proportional to the cardinality of the largest frequent
itemset.
The sequential algorithm targeted in this work makes use of tree
structures to efficiently search and prune the space of itemsets.
We devise a mechanism for efficient serialization and merging of
local trees called “strip-marshaling and merging”, to allow each
node to make global decisions as to which itemsets are frequent.
Through empirical evaluation, we illustrate the effectiveness of the
proposed optimizations.
4
Challenge

Computational Complexity


The complexity of itemset mining primarily arises from the
exponential (over the set of items) size of the itemset lattice.
Communication Costs



The time spent in communication can vary significantly
depending on the type of information communicated.
the degree of synchronization required can vary greatly.
While this approach minimizes the number of inter-machine
communications, it can suffer from high communication
overhead when the number of machines and the size of the
input data set are large(the data is distribute).
5
Challenge

I/O Costs


When mining very large data sets, care should also be taken to
reduce I/O overheads.
Load Imbalance


Another issue in parallel data mining is to achieve good
computational load balance among the machines in the system.
Even if the input data set is evenly distributed among machines
or replicated on each machine, the computational load of
generating candidates may not be evenly distributed.
6
FPGrowth in Serial

FPGrowth can prune the search space very
efficiently and can handle the problem of
skewness by projecting the data set during
the mining process.
7
FPGrowth in Serial

Generate a Header table of FP tree
8
Frequent-Pattern (FP) Growth
(cont.)
9
A.Read<f,c,a,m,p>
(ordered)Frequent item
B.Read<f,c,a,b,m>
root
f,c,a,m,p
root
f:1
f,c,a,b,m
f:2
f,b
c:1
c:2
c,b,p
a:1
a:2
m:1
m:1
f,c,a,m,p
p:1
C.Read<f,b>
p:1
D.Read<c,b,p>
root
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
b:1
m:1
E.Read<f,c,a,m,p>
root
f:3
c:2
a:2
m:1
p:1
b:1
m:1
b:1
root
c:1
f:4
b:1
c:3
p:1
a:3
m:2
p:2
b:1
m:1
b:1
c:1
b:1
p:1
11
FP-Growth con.
Items
Prefix path
Frequent item set
p
<f:2,c:2,a:2,m:2>
<c:1, b:1>
({p}:3)({c,p}:3)
m
< f:2,c:2,a:2>
<f:1,c:1,a:1,b:1>
({m}:3)({a,m}:3)({c,m}:3)({f,m}:3)
({c,a,m}:3)
({f,a,m}:3) ({f,c,m}:3) ({f,c,a,m}:3)
b
<f:1,c:1,a:1>
<f:1>
<c:1>
({b}:3)
a
<f:3,c:3>
({a}:3) ({c,a}:3) ({f,a}:3) ({f,c,a}:3)
c
<f:3>
({c}:4) ({f,c}:3)
f
null
({f}:4)
12
Parallel Optimizations


We target a distributed-memory parallel architecture,
where each node has one or more local disks and data
exchange among the nodes is done through message
passing.
Minimizing Communication Costs



One of the fundamental challenges in parallel frequent itemset
mining is that global knowledge is required to prove an itemset
is not frequent.
Our solution is for each machine to communicate its local
FPTree to all other machines using a ring broadcast protocol.
To further reduce the volume of communication, instead of
communicating the original tree, each processor transfers a
concise encoding, a process we term strip marshaling. The tree
is represented as a array of integers.
13
14
Parallel Optimizations..con

Pruning Redundant Data


Prune the local prefix tree of unnecessary
data.
Construct as new FPTree. To correctly mine i
only the nodes from occurrences of i to the
root are required.
15
Parallel Optimizations..con





Let the items assigned to machine M be I.
Consider any i ∈ I, and let ni be a node in the
FPTree with its label.
Suppose a particular path P In the tree of
length k is
P = (n1, n2, ..., ni, ni + 1, ..., nk).
To correctly mine item i, only the portion of P
from n1 to ni is required for correctness.
Therefore, for every path P in the FPTree, M
may delete all nodes nj occurring after ni.
16
Parallel Optimizations..con
17
Parallel Optimizations..con


Partitioning the Mining Process
 Mapping of tasks to machines before any mining occurs.
 Intelligent allocation can in this fashion can lead to efficient load
balancing with low process idle times, in particular with
FPGrowth.
Memory Hierarchy-conscious Mining
 Local tree pruning lowers the required main memory
requirements. However, it may still be the case that the size of
the tree exceeds main memory. In such cases, mining the tree
can incur page faults. To remedy this issue, we incorporate
several optimizations to improve the performance.
 Improving the Tree Building Process
 Improving the Mining Process
18

Putting It All Together: Architecture-conscious
Data Mining
19
Experiments
Evaluating Storage Reduction
(Kosarak(8),Mushroom(23),Chess(37))
20
Experiments
Evaluating Commnuication Reduction
Global T’ is the size of the local tree if we maintain full global knowledge at each node.
Local T is the average size of the local tree if we maintain only local data at each node.
Local T’ is the average size of the local tree if we maintain partial global knowledge at each node.
21
Experiments
Evaluating Weak Scalability
DD=Data Distribution
IDD=Intelligent Data Distribution
CD=Count Distribution
DFP=Distributed FPGrowth
22
Experiments
Evaluating Strong Scalability
DD=Data Distribution
DFP=Distributed FP Growth
DD and IDD do not scale well on large data sets.
23
Related Work

Count Distribution (CD)





In CD, each node scans its local data set to obtain frequencies of
candidates.
The frequency information is then shared amongst the
processors to generate a global frequency array.
Once the global frequencies are obtained , each processor
generates a new set of candidates for the next level.
The algorithm then repeats, as each node again scans the data
set to obtain frequencies for the new candidate list.
CD scales poorly as the number of candidates increases, since
the full list is generated by every node.
24
Related Work

Data Distribution (DD)





Data Distribution distributes the data set to each node.
At the start of each level, each node takes a turn
broadcasting its local data set to every other node.
Each machine generates a mutually disjoint partition
of candidates.
At each level, frequent itemsets are communicated so
that each processor can asynchronously generate its
share of candidates for the next level.
However, it incurs very high communication overhead
when broadcasting the entire data set, which is
especially costly in case of large data sets.
25
Related Work

Intelligent Data Distribution (IDD) & Hybrid Distribution
(HD)





Intelligent Data Distribution (IDD) reduces the communication
overhead by distributing the data using a ring all-to-all broadcast
algorithm.
IDD still suffers from the high overhead of transferring the entire
data set at each level.
To address this problem while minimizing communication and
computation overheads, Hybrid Distribution (HD) combines the
advantages of both the CD and IDD algorithms by dynamically
grouping processors and partitioning the candidate set
accordingly to maintain good load balance.
Each group is large enough to hold the candidate list in main
memory, and executes the IDD algorithm.
Then between group communications are performed as in CD.
26
Conclusion


In this Paper , we can fully understand that mine the
frequent items from the large data set of the parallel
architecture.
On the other hand , this paper’s arrangement is not easy
to read for the reader.
27
Thank You
28