Download DisCo: Distributed Co-clustering with Map

Spiros Papadimitriou Jimeng Sun IBM T.J. Watson Research Center Hawthorne, NY, USA Reporter: Nai-Hui, Ku  Introduction  Related Work  Distributed Mining Process  Co-clustering Huge Datasets  Experiments  Conclusions  Problems   Huge datasets Natural sources of data are impure form  Proposed    Method A comprehensive Distributed Co-clustering (DisCo) solution Using Hadoop DisCo is a scalable framework under which various co-clustering algorithms can be implemented  Map-Reduce      framework employs a distributed storage cluster block-addressable storage a centralized metadata server a convenient data access storage API for Map-Reduce tasks  Co-clustering  Algorithm  cluster shapes checkerboard partitions  single bi-cluster  Exclusive row and column partitions  overlapping partitions  Optimization criteria  code length  Transform rawVisual data into results, or turned Identifying the source and the appropriate intoformat the input for other obtaining the data for data analysis applications.  Data  Processing 350 GB raw network event log    pre-processing Needs over 5 hours to extract source/destination IP pairs Achieve much better performance on a few commodity nodes running Hadoop Setting up Hadoop required minimal effort  Specifically for co-clustering, there are two main preprocessing tasks:   Building the graph from raw data Pre-computing the transpose During co-clustering optimization, we need to iterate over both rows and columns.  Need to pre-compute the adjacency lists for both the original graph as well as its transpose   Definitions     Matrices are denoted by boldface capital letters Vectors are denoted by boldface lowercase a letters aij:the (i, j)-th element of matrix A Co-clustering algorithms employs a checkerboard     and overview the original adjacency matrix  a grid of submatrices An m x n matrix, a co-clustering is a pair of row and column labeling vectors r(i):the i-th row of the matrix G: the k×ℓ group matrix A  gpq gives the sufficient statistics for the (p, q) sub-matrix  Map function  Reduce function  Global sync  Setup           39 nodes Two dual-core processors 8GM RAM Linux RHEL4 4Gbps Ethernets SATA, 65MB/sec or roughly 500 Mbps The total capacity of our HDFS cluster was just 2.4 terabytes HDFS block size was set to 64MB (default value) JAVA Sun JDK version 1.6.0_03  The pre-processing step on the ISS data  Default values     39 nodes 6 concurrent maps per node 5 reduce tasks 256MB input split size  Using  relatively low-cost components I/O rates that exceed those of high-performance storage systems.  Performance scales almost linearly with the number of machines/disks.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download DisCo: Distributed Co-clustering with Map