Download Document

Toward Terabyte Pattern Mining An Architecture-conscious Solution 2007 ACM PPoPP’07 March 14–17, 2007, San Jose, California, USA. Copyright c 2007 ACM 978-1-59593-602-8/07/0003. Gregory Buehrer The Ohio State University Srinivasan Parthasarathy The Ohio State University Tahsin Kurc The Ohio State University Shirish Tatikonda The Ohio State University Joel Saltz The Ohio State University OUTLINE Instractution Challenges FPGrowth in Serial Parallel Optimizations Experiments Related Work Conclusion 2 Introduction       First,Many data mining algorithms are computationally complex and often scale non-linearly with the size of the data set (even if the data can fit in memory). Second, when the data sets are large, the algorithms become I/O bound. Third, the data sets and the meta data structures employed may exceed disk capacity of a single machine. A natural cost-effective solution to this problem that several researchers is to parallelize such algorithms on commodity clusters to take advantage of distributed computing power, aggregate memory and disk space, and parallel I/O capabilities. However, to date none of these approaches have been shown to scale to terabyte-sized data sets. Moreover, as has been recently shown, many of these algorithms are greatly under-utilizing modern hardware. Finally, many of these algorithms rely on multiple database scans and oftentimes transfer subsets of the data around repeatedly. 3 Introduction..       Our solution embraces one of the fastest known sequential algorithms (FPGrowth), and extends it to work in a parallel setting, utilizing all available resources efficiently. In particular, our solution relies on an algorithmic design strategy that is cache-, memory-, disk- and network- conscious, embodying the comprehensive notion of architecture-conscious data mining. A highlight of our parallel solution is that we only scan the data set twice, whereas all existing parallel implementations based on the well-known Apriori algorithm require a number of data set scans proportional to the cardinality of the largest frequent itemset. The sequential algorithm targeted in this work makes use of tree structures to efficiently search and prune the space of itemsets. We devise a mechanism for efficient serialization and merging of local trees called “strip-marshaling and merging”, to allow each node to make global decisions as to which itemsets are frequent. Through empirical evaluation, we illustrate the effectiveness of the proposed optimizations. 4 Challenge  Computational Complexity   The complexity of itemset mining primarily arises from the exponential (over the set of items) size of the itemset lattice. Communication Costs    The time spent in communication can vary significantly depending on the type of information communicated. the degree of synchronization required can vary greatly. While this approach minimizes the number of inter-machine communications, it can suffer from high communication overhead when the number of machines and the size of the input data set are large(the data is distribute). 5 Challenge  I/O Costs   When mining very large data sets, care should also be taken to reduce I/O overheads. Load Imbalance   Another issue in parallel data mining is to achieve good computational load balance among the machines in the system. Even if the input data set is evenly distributed among machines or replicated on each machine, the computational load of generating candidates may not be evenly distributed. 6 FPGrowth in Serial  FPGrowth can prune the search space very efficiently and can handle the problem of skewness by projecting the data set during the mining process. 7 FPGrowth in Serial  Generate a Header table of FP tree 8 Frequent-Pattern (FP) Growth (cont.) 9 A.Read<f,c,a,m,p> (ordered)Frequent item B.Read<f,c,a,b,m> root f,c,a,m,p root f:1 f,c,a,b,m f:2 f,b c:1 c:2 c,b,p a:1 a:2 m:1 m:1 f,c,a,m,p p:1 C.Read<f,b> p:1 D.Read<c,b,p> root f:3 c:2 a:2 m:1 p:1 b:1 m:1 b:1 b:1 m:1 E.Read<f,c,a,m,p> root f:3 c:2 a:2 m:1 p:1 b:1 m:1 b:1 root c:1 f:4 b:1 c:3 p:1 a:3 m:2 p:2 b:1 m:1 b:1 c:1 b:1 p:1 11 FP-Growth con. Items Prefix path Frequent item set p <f:2,c:2,a:2,m:2> <c:1, b:1> ({p}:3)({c,p}:3) m < f:2,c:2,a:2> <f:1,c:1,a:1,b:1> ({m}:3)({a,m}:3)({c,m}:3)({f,m}:3) ({c,a,m}:3) ({f,a,m}:3) ({f,c,m}:3) ({f,c,a,m}:3) b <f:1,c:1,a:1> <f:1> <c:1> ({b}:3) a <f:3,c:3> ({a}:3) ({c,a}:3) ({f,a}:3) ({f,c,a}:3) c <f:3> ({c}:4) ({f,c}:3) f null ({f}:4) 12 Parallel Optimizations   We target a distributed-memory parallel architecture, where each node has one or more local disks and data exchange among the nodes is done through message passing. Minimizing Communication Costs    One of the fundamental challenges in parallel frequent itemset mining is that global knowledge is required to prove an itemset is not frequent. Our solution is for each machine to communicate its local FPTree to all other machines using a ring broadcast protocol. To further reduce the volume of communication, instead of communicating the original tree, each processor transfers a concise encoding, a process we term strip marshaling. The tree is represented as a array of integers. 13 14 Parallel Optimizations..con  Pruning Redundant Data   Prune the local prefix tree of unnecessary data. Construct as new FPTree. To correctly mine i only the nodes from occurrences of i to the root are required. 15 Parallel Optimizations..con      Let the items assigned to machine M be I. Consider any i ∈ I, and let ni be a node in the FPTree with its label. Suppose a particular path P In the tree of length k is P = (n1, n2, ..., ni, ni + 1, ..., nk). To correctly mine item i, only the portion of P from n1 to ni is required for correctness. Therefore, for every path P in the FPTree, M may delete all nodes nj occurring after ni. 16 Parallel Optimizations..con 17 Parallel Optimizations..con   Partitioning the Mining Process  Mapping of tasks to machines before any mining occurs.  Intelligent allocation can in this fashion can lead to efficient load balancing with low process idle times, in particular with FPGrowth. Memory Hierarchy-conscious Mining  Local tree pruning lowers the required main memory requirements. However, it may still be the case that the size of the tree exceeds main memory. In such cases, mining the tree can incur page faults. To remedy this issue, we incorporate several optimizations to improve the performance.  Improving the Tree Building Process  Improving the Mining Process 18  Putting It All Together: Architecture-conscious Data Mining 19 Experiments Evaluating Storage Reduction (Kosarak(8),Mushroom(23),Chess(37)) 20 Experiments Evaluating Commnuication Reduction Global T’ is the size of the local tree if we maintain full global knowledge at each node. Local T is the average size of the local tree if we maintain only local data at each node. Local T’ is the average size of the local tree if we maintain partial global knowledge at each node. 21 Experiments Evaluating Weak Scalability DD=Data Distribution IDD=Intelligent Data Distribution CD=Count Distribution DFP=Distributed FPGrowth 22 Experiments Evaluating Strong Scalability DD=Data Distribution DFP=Distributed FP Growth DD and IDD do not scale well on large data sets. 23 Related Work  Count Distribution (CD)      In CD, each node scans its local data set to obtain frequencies of candidates. The frequency information is then shared amongst the processors to generate a global frequency array. Once the global frequencies are obtained , each processor generates a new set of candidates for the next level. The algorithm then repeats, as each node again scans the data set to obtain frequencies for the new candidate list. CD scales poorly as the number of candidates increases, since the full list is generated by every node. 24 Related Work  Data Distribution (DD)      Data Distribution distributes the data set to each node. At the start of each level, each node takes a turn broadcasting its local data set to every other node. Each machine generates a mutually disjoint partition of candidates. At each level, frequent itemsets are communicated so that each processor can asynchronously generate its share of candidates for the next level. However, it incurs very high communication overhead when broadcasting the entire data set, which is especially costly in case of large data sets. 25 Related Work  Intelligent Data Distribution (IDD) & Hybrid Distribution (HD)      Intelligent Data Distribution (IDD) reduces the communication overhead by distributing the data using a ring all-to-all broadcast algorithm. IDD still suffers from the high overhead of transferring the entire data set at each level. To address this problem while minimizing communication and computation overheads, Hybrid Distribution (HD) combines the advantages of both the CD and IDD algorithms by dynamically grouping processors and partitioning the candidate set accordingly to maintain good load balance. Each group is large enough to hold the candidate list in main memory, and executes the IDD algorithm. Then between group communications are performed as in CD. 26 Conclusion   In this Paper , we can fully understand that mine the frequent items from the large data set of the parallel architecture. On the other hand , this paper’s arrangement is not easy to read for the reader. 27 Thank You 28

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document