Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
I2.2 Large-Scale Information Network Processing Mid-Year Report Charu Aggarwal (IBM) Christos Faloutsos (CMU) Ambuj Singh (UCSB) Xifeng Yan (UCSB) March, 2011 Task Setting Indexing, Partitioning, and Distributed Processing on Time-Varying Networks 2 INARC I2.2 Mid-Year Report Objectives – Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science Key Technical Innovations: − Dynamic graph indexing models and structures − Scalable graph processing − Graph partition overlapping and re-balancing theory Primary Members – Xifeng Yan (UCSB), Ambuj Singh (UCSB), Charu Aggarwal (IBM), Christos Faloutsos (CMU) Collaborative Members – Z. Wen IBM/SCNARC, J. Bao RPI/IRC, J. Han UIUC/INARC, M. Srivatsa IBM/INARC, V. Kawadia BBN/IRC, S. Desai Army 3 I2.2: Large-Scale Information Network Processing Key Objective: − Novel graph index model and advanced graph distributed computing theory to facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science Deliverables: Q1: Data collection and cleaning for graph indexing and distributed graph computing Q2: Design graph indices with time-varying concerns Q3: Design and test distributed graph computing strategies Q4: Hypotheses validation and research paper submission Impact: − Provide fast, scalable, and linked information access to soldiers and commanders Key Technical Innovations − Dynamic graph indexing models and structures to resolve graph queries in time-varying information networks − Query cost models for distributed graph processing − Graph partition overlapping and re-balancing theory to (1) improve locality of data for parallel computing, and (2) accommodate dynamic network data updates and query workload changes − Self evolving distributed graph processing environment to adjust graph partitions dynamically Role Researchers Lead Xifeng Yan, UCSB Primary Ambuj Singh, UCSB Primary Charu Aggarwal, IBM Primary Christos Faloutsos, CMU Collab Z. Wen, IBM, SCNARC Collab G. Cao, PSU, CNARC Collab J. Han, UIUC, INARC Total $326.7K 4 Advance State-of-the-Art Network Science Large-Scale Information Network Processing: Invent scalable information network infrastructure Facilitate processing of (military) linked data that becomes a bottleneck for many research tasks in network science Advance our understanding of scalability challenges, not only for information networks but also for other genres of complex networks The models and the proposed experimental systems provide fundamental analysis of – How indexing of dynamic network data affects query performance, – How graph partitioning schemes affect distributed query processing, – How the models and laws of real networks affect the design of graph indexing and partitioning strategy 5 5 Military Relevance Subtask 1: Graph Index and Search (UCSB, IBM) – Fast access and processing of time-varying information networks is the key for tasks such as intelligence service and query processing. Simply speaking, we cannot access networks nodes by nodes! Subtask 2: Graph over MapReduce (CMU) – To process overwhelming amount of data on the Web, social networks, emails, telecommunications, to distill important information such as people’s opinion about extremists, to find potential radical groups, to identify influential nodes, we need powerful graph processing methods. – Needed by any large-scale network data processing including information, social and communication networks Subtask 3: Graph Partitioning/Distributed Graph Processing (UCSB, CMU) – Military information is often distributed in many devices, distributed graph processing run graph algorithms without putting all data together in the same machine 6 6 Subtask 1: Graph Index and Search Indexing Methods for Large Scale Static and Dynamic Networks Methods for Indexing Massive Disk-Resident Graphs (Aggarwal (IBM), Zhao (UIUC), and Han (UIUC)) Methods for Indexing Dynamic Network Streams (Aggarwal (IBM), Khan (UCSB), Yan (UCSB)) Dynamic structural index for label-based queries (Aggarwal (IBM) and Li (UCSB)): SDM 2011 accepted. Analysis of significant substructures in time-varying networks (Singh (UCSB) et al.) Find highest scoring substructures combines structure and time 7 gDensity: Model-Based Indexing Problem definition (labeled proximity search) – Label-based graph proximity search, seeks to find the top-k vertex subsets with the smallest diameters, for a given query of distinct labels. Each subset must cover all the labels specified in the query. Q=(“reconnaissance”, Q=(a, b, c) “biometric matching”, “failure modeling”) d=2 d=3 Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal 8 gDensity: Ideas and Results Can we do better? Which one is more promising? u’s density distribution v’s density distribution 10 – 300 times faster Nan Li et al., Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal 9 Graph Search: a Model-Based Approach Align two networks Ideas Use information propagation model to (a) linkedin propagate labels in information networks Convert vertices to vectors Align sets of vectors (b) facebook Query Speed: 0.1 sec for WebGraph:10M vertices, 213M edges Information Propagation Model A. Khan et al., Neighborhood Based Fast Graph Search in Large Networks, SIGMOD’11 10 SEARCH ALGORITHM Step 1: Match a node u of target graph G with some node v of query graph Q, if L(v) ⊆ L(u) and cost(u,v) is less than a predefined cost threshold ε. u1 Step 2: Discard the labels of the f unmatched nodes in the target graph. u4 u2 v1 v3 v2 u3 v4 u5 u6 Step 3: Propagate the labels only among the matched nodes from the previous step. Repeat steps 1 and 2 until no node can be discarded further. G Q 11 Dynamic Updates Dynamic Update in Index vs. Re-indexing (DBLP) Indexing is performed for h=2 hops. 12 Subtask 2: Graph Over MapReduce Investigate graph properties and graph algorithms using MapReduce – Spectral Analysis of Billion-Scale Graphs – Patterns on the Connected Components of Terabyte-Scale Graphs Study the limitation of the MapReduce architecture on processing network-centric data – Using the discovered patterns of terabyte-scale real-life graphs. 13 13 13 Spectral Analysis of Billion-Scale Graphs Billion-Scale Eigen-solver Computes top-k eigen-values and eigenvectors Find anomalies in large graphs. Many application: SVD, triangle counting, spectral clustering, … A careful implementation of Lanczos on hadoop can give excellent accuracy as well as scalability Contribution: HEigen: a billion-scale eigensolver which can handle 1000x larger matrices than previous methods Application of the eigensolver on the twitter graph helps us spot abnormal users (adult advertisers) U Kang, et al. Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11 14 Patterns on the Connected Components of Terabyte-Scale Graphs A large graph is composed of many connected components – Q1: static patterns? – Q2: evolution patterns? Count – Q3: model? Metric: Graph Fractal Dimension(G): log |E| / log |V| Size YahooWeb graph |V| = 1.4 billion |E| = 6.7 billion 120 GBytes U Kang, et al. Patterns on the Connected Components of Terabyte-Scale Graphs. ICDM 2010 15 Subtask 3: Graph Partition for Distributed Graph Computing Are typical techniques efficient for graph queries? Graph partitioning and distribution techniques (e.g., Pregel) Limitations: – Unavailable to the public – Unbalanced workload due to skewed uniformly distributed graph queries. – Communication overhead due to inter‐machine (cross partition) communication. Sedge: distributed graph processing – Model-based Graph Partitioning Techniques – First-of-Its Kind Distributed Graph Computing Platform for Information, Social, and Communication Networks Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing submitted to VLDB 2011 16 Graph Partition Models Complementary Partitions - Generate partitions sets that are complementary to each other Dynamic Workload: Replicate Partitions - Replicate partitions that are intensively accessed by many queries Dynamic Workload: New Partitions - Generate new partitions that are intensively accessed by many cross-partition queries Shengqi Yang, et al., Managing Large-Scale Graphs for Efficient Distributed Processing submitted to VLDB 2011 17 Graph partitioning with region constraint Optimal Solution: Where: NP-hard 18 Global Optimization Iteratively repartition the graph Before each iteration, increase the weight of edges in each region wrt. its priority 19 Graph Partition for Distributed Graph Computing # of Machines vs. Throughput Improvement Ratio 10,000 random queries. Increase partition number by adding more machines. 20 Collaborations and Path Ahead Collaborations within I2 – Monthly meeting – Strong connection between I2.1 and I2.2: One problem, two sides. information network processing on DTN and Clusters – (I2.1) Work with Arun Iyengar and Mudahakar Srivatsa (IBM), who has done much work on DTN and Storage. Shengqi Yang will intern at IBM this summer. Collaborations with researchers in other networks – (S1.1) Work with Zhen Wen (IBM), on the social network application of graph density indexing. U Kang was a summer intern at IBM – (E1.1, R2.3) Work with Jie Bao (RPI), on RDF queries using neighborhoodbased graph search. – (T2.3) Work with Vikas Kawadia (BBN), on using graph query processing for distributed trust computing. Ziyu Guan is collaborating with Vikas – Graph search has connection with (T2.4) M. Goldberg’s work on trust structure. – Work with Sachi Desai (Army) on graph query language/system. 21 Next Six Months and Path Ahead to 2012 Continue research on large-scale information network processing (more specific) (1) Graph indexing on multiple time-varying graph snapshots (2) Compression-based, Model-based Info Network Processing (3) Edge lay-out on Hadoop file system for better compression and better performance (4) Complementary graph partitioning theories. Other research topics planned – Models and methods for building complex graph queries – Models and methods for routing complex graph queries to data sources (for both I2.1 and I2.2) – Tensor analysis on Hadoop 22 Research Papers (Accepted/Published) A. Khan, N. Li, Z. Guan, X. Yan, S. Chakraborty, and S. Tao, Neighborhood Based Fast Graph Search in Large Networks, Proc. 2011 Int. Conf. on Management of Data (SIGMOD'11), 2011. Nicholas D Larusso and Ambuj K. Singh, "Synopses for Probabilistic Data over Large Domains", EDBT'11 C. C. Aggarwal, N. Li, On Dynamic Node-Classification in Content-based Networks, SIAM International Conference on Data Mining (SDM) 2011 U Kang, Mary McGlohon, Leman Akoglu, and Christos Faloutsos. Patterns on the Connected Components of Terabyte-Scale Graphs. IEEE International Conference on Data Mining (ICDM) 2010, Sydney, Australia. U Kang, Brendan Meeder, Christos Faloutsos, Spectral Analysis of Billion-Scale Graphs: Discoveries and Implementation, PAKDD'11 U Kang, Duen Horng Chau, and Christos Faloutsos. Mining Large Graphs: Algorithms, Inference, and Discoveries. IEEE International Conference on Data Engineering (ICDE) 2011, Hannover, Germany. 23 Research Papers Shengqi Yang, Bo Zong, Arijit Khan, Ben Zhao, Xifeng Yan, Managing LargeScale Graphs for Efficient Distributed Processing, submitted to VLDB 2011 Nan Li, Arijit Khan, Xifeng Yan, and Zhen Wen, Density Index and Proximity Search on Large Graphs, to be submitted to VLDB Journal Petko Bogdanov, Misael Mogiovi, Ambuj Singh, Mining Heavy-Edges Subnetworks in Time, to be submitted to VLDB Journal C. C. Aggarwal, P. Zhao, J. Han. On Shortest-Path Indexing of Massive Disk Resident Graphs, Research Report, to be submitted to VLDB Journal C. C. Aggarwal, A. Khan, X. Yan. A Probabilistic Index for Massive and Dynamic Graph Streams, Research Report, to be submitted to VLDB Journal 24 Big Picture Stage 1: How to distribute graphs (we are here) Stage 2: How to construct queries Stage 3: How to execute/route queries Make Information Network Accessible by Soldiers and Commanders 25 Questions? 26