Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
29th IEEE International Conference on DATA ENGINEERING BRISBANE, AUSTRALIA | 8 – 11 APRIL 2013 www.icde2013.org IEEE Technical Committee on Data Engineering tab.computer.org/tcde What is the Technical Committee on Data Engineering? The Technical Committee on Data Engineering (TCDE) of the IEEE Computer Society is concerned with the role of data in the design, development, management and utilization of information systems. The TCDE sponsors the International Conference on Data Engineering (ICDE). It publishes a quarterly newsletter, the Data Engineering Bulletin. There are approximately 1500 members of the TCDE. How to join the Technical Committee? If you are a member of the IEEE Computer Society, you may join the TCDE and receive copies of the Data Engineering Bulletin without cost. To become a member follow the membership form link from this page: tab.computer.org/tcde IEEE Computer Society Member benefits and services: www.computer.org/portal/web/membership/join Stay ahead of the technology curve with easy access to the most up-to-date and advanced information in the computing world. Advance your career with access to top e-learning courses, online books and leading publications in your area of expertise. Network with the world’s foremost technology professionals. Lead the community with volunteering and mentoring opportunities that enable you to both gain exposure and contribute to the field as an author and reviewer. Table of Contents Message from the ICDE 2013 Program Committee & General Chairs 2 Conference Venue 5 Conference at a Glance 7 Monday 8 April Monday Detailed Program 20 Tuesday 9 April Tuesday at a Glance Tuesday Detailed Program 8 31 Wednesday 10 April Wednesday at a Glance Wednesday Detailed Program 12 65 Thursday 11 April Thursday at a Glance Thursday Detailed Program 16 82 Social Program 110 Transport Information 109 Registration & Information Desk 111 Volunteers112 ICDE 2013 Committees ICDE 2013 Conference 113 1 Message from the ICDE 2013 Program committee and general chairs Established in 1984, ICDE has become a premier forum for the dissemination of data management research results among researchers, users, practitioners, and developers. The 29th IEEE International Conference on Data Engineering takes place in Brisbane, QLD, Australia, from April 8 to 11, 2013. We are proud to present its proceedings. Each of the main conference days features a keynote by a distinguished scientist: Vishal Sikka from SAP (April 9), Alon Halevy from Google (April 10) and Gustavo Alonso from ETH Zurich (April 11). We thank all authors who submitted their work to ICDE for making the conference happen. We received 443 paper submissions for the research track, 20 submissions for the industrial track, and 69 demo proposals. The program committee was organized into 16 topic-based areas, each of which was headed by an area chair that was in charge of overseeing the evaluation of submissions assigned to that area. The area chairs were: Magda Balazinska, Panos Chrysanthis, Amol Deshpande, Xin (Luna) Dong, Alan Fekete,Vagelis Hristidis, Paul Larson, Wolfgang Lehner, Xuemin Lin, Srinivasan Parthasarathy, Jian Pei, Simonas Saltenis, Pierangela Samarati, Nesime Tatbul, Xifeng Yan and Jeffrey Xu Yu. Each submission was assigned to three reviewers from the research program committee that consisted of 144 members, the industrial program committee that consisted of 12 members, and the demo program committee that consisted of 28 members. The evaluation process had several phases: assignment of papers to reviewers, reviewing, discussions among reviewers, decision making by area chairs, consolidation of decisions, and handling of papers assigned for shepherding. As a result of these efforts, the research program features 95 papers, the industrial program 8 papers, and the demonstration program 27 demos. The conference program also includes 9 seminar tutorials and one panel. As a feature of ICDE conferences in recent years, all papers are presented at a poster session. Accompanying the main conference are 8 workshops. The success of ICDE 2013 is in large measure the result of volunteer work by many committed scientists, all with many demands on their time, who contributed their expertise and time generously to making the conference a success. 2 We thank the area chairs, mentioned above, and the many program committee members for all their essential efforts. Next, we thank Sang Cha and Haixun Wang who served as industrial program chairs; Yoshiharu Ishikawa, Yanchun Zhang and Rui Zhang who served as demo chairs; Alexandros Labrinidis who served as seminar chair; Dimitrios Georgakopoulos and Jun Yang who served as panel chairs; Chee Yong Chan and Kjetil Nørvåg who served as workshops chairs; and the organizers of the accompanying workshops, including Gottfried Vossen and Min Wang, who chaired the Ph.D. Symposium. ICDE 2013 Conference Message from the ICDE 2013 Program Committee & General Chairs We also express our deep appreciation of the outstanding work put in over many months by the organization team: Shazia Sadiq and Heng Tao Shen served as local organization chairs, Marta Indulska served as finance chair, Mohamed Sharaf served as web and publicity chair, and Jiaheng Lu and Egemen Tanin served as proceedings chairs. We thank Kathleen Williamson from The University of Queensland for assisting with coordination and a wide range of local organization tasks. Carmen Saliba and Alkenia Winston from the IEEE Computer Society’s Conference Support Services helped secure the various necessary contracts in a timely manner. We are also thankful to the many student volunteers from The University of Queensland. We also wish to acknowledge the CMT team at Microsoft for their assistance. It has been a pleasure to work with such a committed and insightful group of people who really care about ICDE and the data management research community. Without the contributions of all of these people, the conference would not have been a success. Two committees consisting of distinguished and trusted members of the data management community are in charge of identifying the year’s best ICDE paper, namely Divyakant Agrawal, Dan Suciu,Yufi Tao and Gerhard Weikum (chair), and the most influential paper ICDE published 10+ years hence, namely AnHai Doan, Christos Faloutsos, Donald Kossmann, Samuel Madden and Krithi Ramamritham, with Paul Larson serving as coordinator. We also gratefully acknowledge the financial support of our sponsors: SAP as Diamond Sponsor, Microsoft and Tourism and Events Queensland as Platinum Sponsors, HP and The University of Queensland as Gold Sponsors, CSIRO, RMIT University, Oracle Labs, SA Center for Big Data Research at Renmin University, The University of Melbourne, and NICTA as Silver Sponsors, and Google, NEC, Facebook and The University of New South Wales as Bronze Sponsors. Finally, we thank all presenters and conference participants. We hope you all enjoy the conference! ICDE 2013 PC Chairs Christian S. Jensen, Aarhus University, Denmark Chris Jermaine, Rice University, USA Xiaofang Zhou, The University of Queensland, Australia ICDE 2013 General Chairs Rao Kotagiri, The University of Melbourne, Australia Beng Chin Ooi, National University of Singapore, Singapore ICDE 2013 Conference 3 Message from the Minister for Tourism, Major Events, Small Business and the Commonwealth Games, the Honourable Jann Stuckey MP Welcome to Brisbane as Queensland’s capital city plays host to the IEEE International Conference on Data Engineering (ICDE) for the first time in the event’s history, celebrating its 29th year in 2013. The Newman Government is proud to host delegates from interstate and overseas, as the ICDE provides a platform to address issues in designing, building, managing and evaluating advance data-intensive systems and applications. Tourism and events are intrinsically linked. This is why our Government has merged Events Queensland and Tourism Queensland into a single entity, Tourism and Events Queensland, to deliver the best outcomes for the State. Business events contribute almost $700 million annually to our economy and events like ICDE will raise Queensland’s profile and increase visitation to the State. The 29th IEEE International Conference on Data Engineering joins a growing calendar of events for Queensland which, for Brisbane, includes the QPAC International Series featuring the Bolshoi Ballet, British & Irish Lions Tour, Brisbane Festival and many more. I wish you a successful conference and hope you enjoy your stay in Queensland’s capital city. The Honourable Jann Stuckey Queensland Minister for Tourism, Major Events, Small Business and the Commonwealth Games 4 ICDE 2013 Conference Conference Venue ICDE 2013 will be held at the Sofitel Brisbane Central Hotel, main entrance, 249 Turbot Street Brisbane CBD, and conveniently located above Central Railway Station. The Sofitel also has a secondary entrance off Ann Street. Below is a map of the Sofitel Hotel’s conference centre. All conference sessions and catering will be held in this area. Please note that the only room not located on this floor is the Odeon room, which is located down the escalator on the ground floor. BASTILLE 1 CONCORDE BASTILLE 2 LIFTS ST GERMAIN BALLROOM 3 BALLROOM 2 BALLROOM 1 BALLROOM LE GRANDE ANN STREET LOBBY ODEON via escalators to ground floor ICDE 2013 Conference TROCADERO 5 ICDE2013 Conference at a Glance Monday 8 april Tuesday 9 april 9 - 9:30AM Opening 9 - 10:30AM W1DESWEB (Full Day) Bastille 2 W2SMDB (Full Day) St Germaine W3PrivDB (Full Day) Concorde W4MoDA (Half Day) Ballroom 3 W5DGSS (Full Day) Bastille 1 W6GDM (Full Day) Ballroom 2 W7DMC (Full Day) Concorde PhD SymposiumOdeon 10:30 - 11AM Break Ballroom 1 9:30 - 10:30AM Keynote: Vishal Sikka Ballroom 1 (SAP AG) 10:30 - 11AM Break 11 - 12:30 PM Seminar 1 Ballroom 2 Seminar 2Odeon R1 Main Memory Databases Ballroom 1 R2 MapReduce Algorithms St Germaine R3 Data History Bastille 1 R4 Top-k Query in Uncertain Data Bastille 2 Industry 1Concorde 11AM - 12:30 PM All workshops continue 12:30 - 1:30PM Lunch 1:30 - 3PM Full day workshops continue 3 - 3:30PM Break 3:30 - 5PM Full day workshops 7 - 9 PM IEEE TCDE Member Reception Summit Restaurant Mt Coot-Tha 12:30 - 2PM Lunch 2 - 3:30PM Demo Groups 1 & 2 R5 Uncertainty in Spatial Data R6 Data Extraction R7 Trajectory Databases R8 Social Networks Ballroom 2 Ballroom 1 St Germaine Bastille 1 Bastille 2 Industry 2Concorde Seminar 3Odeon 3:30 - 4PM Break 4 - 5:30PM Demo Groups 1 & 2 Ballroom 2 R9 Indexing Structures Ballroom 1 R10Main Memory Query Processing St Germaine R11Data Mining I Bastille 1 R12Moving Objects Bastille 2 Industry 3Concorde Seminar 4Odeon 5:30 PM - 7 PMLobby 6 Welcome Reception ICDE 2013 Conference at a Glance Wednesday 10 april 9 - 10AM Thursday 11 april Ballroom Le Grand 9 - 10AM Ballroom 1 & 2 Keynote: Alon Halevy (Google Inc) Keynote: Gustavo Alonso (ETH Zurich) 10 - 10:30AM Break 10 - 10:30AM Break 10:30 - 12PM 10:30 - 12PM R13Data Cleaning Demo Groups 3 & 4 St Germaine R14Social Media I Bastille 1 R15 Data Trust Bastille 2 R16Data on the Cloud Concorde Seminar 5Odeon Ballroom 3 Panel: Big Data for the Public Ballroom 1 & 2 R21Security and Privacy St Germaine R22Randomized Algorithms for Graphs Bastille 1 R23Distributed Data Processing Bastille 2 R24Data Mining II Concorde Seminar 7Odeon 12 - 1:30PM SAP Business Lunch Ballroom Le Grand 1:30 - 2PM Ballroom Le Grand ICDE Award Presentations 2 - 3PM Ballroom Le Grand K4 - 10 year most influential paper 12 - 1:30PM Lunch 1:30 - 3PM Demo Groups 3 & 4 R25Lineage & Provenance R26Similarity Search R27Shortest & Direct Query R28Skyline & Snapshot Query Ballroom 3 St Germaine Bastille 1 Bastille 2 Concorde Seminar 8Odeon 3 - 3:30PM Break 3 - 3:30PM Break 3:30 - 5PM 3:30 - 5PM R17Similarity Ranking St Germaine R18Spatial Databases Bastille 1 R19Social Media II Bastille 2 R20Trees & XML Concorde Seminar Session 6Odeon Posters Session Commences R29Large Graph Indexing R30Web Data R31Query Optimization R32Data Storage Seminar 9Odeon 5 - 6PM 6:30 - 10 PM Banquet Brisbane City Hall Ballroom 1 & 2 St Germaine Bastille 1 Bastille 2 Concorde Ballroom 1 & 2 Posters & Drinks 7 Tuesday 9 April at a Glance: Keynote, Seminar, Industry & Demo Sessions Opening 9 - 9:30AM Ballroom 1 9:30 - 10:30AM Keynote: Vishal Sikka Ballroom 1 (SAP AG, (p. 31)) 10:30 - 11AM Break Ballroom 2 12:30 - 2PM Lunch 2 - 3:30 PM TUE/9 Seminar 1: Machine Learning on Big Data (p.38) 11 - 12:30 PM For details of Research Sessions for Tuesday 9 April, see next page. Demo Groups 1 & 2 (p.49) Twitter+: Build Personalized Newspaper For Twitter A Generic Database Benchmarking Service Aeolus: An Optimizer for Distributed Intra-NodeParallel Streaming Systems Crowd-Answering System via Microblogging With a Little Help from My Friends Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs Very Fast Estimation for Result and Accuracy of Big Data Analytics: the EARL System 3:30 - 4PM 4 - 5.30 PM 8 Break Demo Groups 1 & 2 (p.49) Road Network Mix-zones for Anonymous Location Based Services Query Time Scaling of Attribute Values in Interval Timestamped Databases Extracting Interesting Related Context-dependent Concepts from Social Media Streams using Temporal Distributions VERDICT: Privacy-Preserving Authentication of Range Queries in Location-based Services Real-time Abnormality Detection System for Intensive Care Management ExpFinder: Finding Experts by Graph Pattern Matching ICDE 2013 Conference Tuesday 9 April at a Glance Keynote, Seminar, Industry & Demo Sessions 10:30 - 11AM Odeon Seminar 2: Big Data Integration (p.38) Break Concorde Industry 1 (p.39) Invited Talk: Big Data Analytics at Facebook Invited Paper: Data Services for E-tailers Leveraging Search Engine Assets Invited Paper: SAP HANA Distributed In-Memory Database System: Transaction, Session, and Metadata Management Seminar 3: Workload Management for Big Data Analytics (p.47) TUE/9 12:30 - 2 PMLunch Industry 2 (p.47) Invited Paper: HFMS: Managing the Lifecycle and Complexity of Hybrid Analytic Data Flows Invited Paper: KuaFu: Closing the Parallelism Gap in Database Replication Materialization Strategies in the Vertica Analytic Database: Lessons Learned 3:30 - 4PM Seminar 4: Knowledge Harvesting from Text and Web Sources (p.62) Break Industry 3 (p.62) Pipe Break Prediction: A Data Mining Method SASH: Enabling Continuous Incremental Analytic Workflows on Hadoop Automating Pattern Discovery for Rule Based Data Standarization Systems 5:30 PM - 7 PMLobby Welcome Reception 9 Tuesday 9 April at a Glance - Research Sessions Opening 9 - 9:30AM Ballroom 1 9:30 - 10:30AM Keynote: Vishal Sikka Ballroom 1 (SAP AG, (p. 31)) 10:30 - 11AM Ballroom 1 Break St Germaine R2 MapReduce Algorithms (p. 33) CPU and Cache Efficient Management of Memory-Resident Databases Finding Connected Components on Mapreduce in Logarithmic Rounds Identifying Hot and Cold Data in MainMemory Databases Enumerating Subgraph Instances Using Map-Reduce The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases Scalable Maximum Clique Computation Using MapReduce TUE/9 11 - 12:30 PM R1 Main Memory Databases (p. 32) 12:30 - 2PM Lunch 2 - 3:30 PM R5 Uncertainty in Spatial Data (p. 40) R6 Data Extraction (p. 42) Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases Attribute Extraction and Scoring: A Probabilistic Approach Interval Reverse Nearest Neighbor Queries on Uncertain Data with Markov Correlations TYPifier: Inferring the Type Semantics of Structured Data Efficient Tracking and Querying for Coordinated Uncertain Mobile Objects 3:30 - 4PM R9 Indexing Structures (p. 55) 4 - 5:30 PM The Bw-Tree: A B-tree for New Hardware Platforms Secure and Efficient Range Queries on Outsourced Databases Using $\ widehat{R}$-trees An Efficient and Compact Indexing Scheme for Large-scale Data Store 10 SUSIE: Search Using Services and Information Extraction Break R10 Main Memory Query Processing (p. 57) Recycling in Pipelined Query Evaluation Efficient Many-Core Query Execution in Main Memory Column-Stores Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware ICDE 2013 Conference Tuesday 9 April at a Glance - Research Sessions 10:30 - 11AM Bastille 1 R3 Data History (p. 34) Time Travel in a Scientific Array Database Time Travel in Column Stores R4 Top-k Query in Uncertain Data (p. 36) Top-k Query Processing in Probabilistic Databases with Non-Materialized Views Cleaning Uncertain Data for Top-k Queries 11 - 12:30 PM Ficklebase: Looking into the Future to Erase the Past Break Bastille 2 TUE/9 Top-K Oracle: A New Way to Present Top-K Tuples for Uncertain Data 12:30 - 2PM Lunch R8 Social Networks (p. 45) Towards Efficient Search for Activity Trajectories Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Networks On Discovery of Gathering Patterns from Trajectories Destination Prediction by Sub-Trajectory Synthesis and Privacy Protection Against Such Prediction SociaLite: Datalog Extensions for Efficient Social Network Analysis LinkProbe: Probabilistic Inference on LargeScale Social Networks 3:30 - 4PM Break R12 Moving Objects (p. 60) Coupled Clustering Ensemble: Incorporating Coupling Relationships Both between Base Clusterings and Objects Large-Scale Dynamic Taxi Ridesharing Service Focused Matrix Factorization For Audience Selection in Display Advertising Efficient Notification of Meeting Points for Moving Groups via Independent Safe Regions Graph Stream Classification using Labeled and Unlabeled Graphs Efficient Distance-Aware Query Evaluation on Indoor Moving Objects 4 - 5:30 PM R11 Data Mining (p. 58) ICDE 2013 Conference 2 - 3:30 PM R7 Trajectory Databases (p. 43) 11 Wed 10 April At a Glance: :Keynote, Seminar, Industry & Demo Sessions 9:00 - 10:00AM Keynote: Alon Halevy Ballroom Le Grand (Google INC. (p. 65)) 10:00 - 10:30AM Break Ballroom le Grande 10:30 - 12 PM For details of Research Sessions for Wednesday 10 April, see next page. 12 - 1:30 PM SAP Business Lunch 1:30 - 3 PM 1:30 - 2 PM ICDE Award Presentations (p. 72) 2 - 3 PM 12 3 - 3:30 PM Break 3:30 - 5 PM WED/10 Keynote: 10 Year Most Influential Papers (p. 72) ICDE 2013 Conference Wed 10 April At a Glance: :Keynote, Seminar, Industry & Demo Sessions 10:00 - 10:30AM Break Odeon Seminar 5: Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services (p.72) 12 - 1:30 PM SAP Business Lunch WED/10 3 - 3:30 PM Break Seminar 6: Triples in the clouds (p.81) 6:30 PM - 10 PM Brisbane City Hall, Auditorium, King George Square Banquet ICDE 2013 Conference 13 Wed 10 April At a Glance: Research Sessions 9:00 - 10:00AM Ballroom Le Grand Keynote: Alon Halevy (Google INC. (p. 65)) 10:00 - 10:30AM Break St Germaine Bastille 1 10:30 - 12 PM R13 Data Cleaning (p. 66) R14Social Media I (p. 67) HANDS: A Heuristically Arranged NonBackup In-line Deduplication System LSII: An Indexing Structure for Exact RealTime Search on Microblogs Holistic Data Cleaning: Putting Violations Into Context Utilizing Social Pressure in Recommender Systems Inferring Data Currency and Consistency for Conflict Resolution Presenting Diverse Location Views with Real-time Near-duplicate Photo Elimination 12 - 1:30 PM SAP Business Lunch 1:30 - 2 PM Ballroom Le Grande WED/10 ICDE Award Presentations (p. 72) 2 - 3 PM Ballroom Le Grande Keynote: 10 Year Most Influential Papers (p. 72) 3 - 3:30 PM 3:30 - 5 PM 14 Break R17Similarity Ranking (p. 73) R18Spatial Databases (p. 76) Efficient Search Algorithm for SimRank Finding Distance-Preserving Subgraphs in Large Road Networks Towards Efficient SimRank Computation on Large Graphs RoundTripRank: Graph-based Proximity with Importance and Specificity Maximum Visibility Queries in Spatial Databases Memory-Efficient Algorithms for Spatial Network Queries ICDE 2013 Conference Wed 10 April At a Glance: Research Sessions 10:00 - 10:30AM Bastille 2 Break COncorde R16 Data on the Cloud (p. 70) Publicly Verifiable Grouped Aggregation Queries on Outsourced Data Streams Catch the Wind: Graph Workload Balancing on Cloud Trustworthy Data from Untrusted Databases EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud On the Relative Trust between Inconsistent Data and Inaccurate Constraints C-Cube: Elastic Continuous Clustering in the Cloud 10:30 - 12 PM R15 Data Trust (p. 69) 12 - 1:30 PM SAP Business Lunch 1:30 - 3 PM WED/10 3 - 3:30 PM Break R20 Trees & XML (p. 79) A Unified Model for Stable and Temporal Topic Detection from Social Media Data Ontology-based subgraph querying Crowdsourced Enumeration Queries On Incentive-based Tagging ICDE 2013 Conference Stratification Driven Placement of Complex Data: A Framework for Distributed Data Analytics 3:30 - 5 PM R19Social Media II (p. 77) Optimizing Approximations of Query Lineage in Probabilistic XML 15 Thurs 11 April At a Glance: Keynote, Seminar, Industry & Demo Sessions 9:00 - 10:00AM Ballroom 1 Keynote: Gustavo Alonso (ETH ZURICH, (p. 82)) 10 - 11:30AM Break Ballroom 1 & 2 11:30 - 12 PM For details of Research Sessions for Thursday 11 April, see next page. 11:30 - 12 PM Panel: Big Data for the Public (p.89) 12 - 1:30 PM Lunch 3:30 - 5 PM THU/11 1:30 - 3 PM 16 3 - 3:30 PM Break Posters & Drinks (until 6 PM) ICDE 2013 Conference Thurs 11 April At a Glance: Keynote, Seminar, Industry & Demo Sessions 10 - 11:30AM Odeon Seminar 7: Querying Encrypted Data (p.89) Break Ballroom 3 Demo Groups 3 & 4 (p.90) Pigora: An Integration System for Probabilistic Data Complex Pattern Matching in Complex Structures: the XSeq Approach T-Music: A Melody Composer based on Frequent Pattern Mining SHARE: Secure information sHaring frAmework for emeRgency management KORS: Keyword-aware Optimal Route Search System CrowdPlanr: Planning Made Easy with Crowd ASVTDECTOR: A Practical Near Duplicate Video Retrieval System 12 - 1:30 PM Lunch Seminar 8: Shallow Information Extraction for the Knowledge Web (p.102) THU/11 3 - 3:30 PM Demo Groups 3 & 4 (p.90) YumiInt - A Deep Web Integration System for Local Search Engines for Geo-referenced Objects A Demonstration of the G* Graph Database System RECODS: Replica Consistency-On-Demand Store SODIT: An Innovative System for Outlier Detection using Multiple Localized Thresholding and Interactive Feedback COLA: A Cloud-based System for Online Aggregation Tajo: A Distributed Data Warehouse System on Large Clusters RoadAlarm: a Spatial Alarm System on Road Networks Break Seminar 9: Secure and Privacy-Preserving Database Services in the Cloud (p.108) ICDE 2013 Conference 17 Thurs 11 April At a Glance: Research Sessions 9:00 - 10:00AM Keynote: Gustavo Alonso Ballroom 1 (ETH ZURICH, (p. 82)) 10 - 11:30AM St Germaine R21Security & Privacy (p. 83) 11:30 - 12 PM Secure Nearest Neighbor Revisited Accurate and Efficient Private Release of Datacubes and Contingency Tables Differentially Private Grids for Geospatial Data Break Bastille 1 R22Randomized Algorithms for Graphs (p. 85) Faster Random Walks By Rewiring Online Social Networks On-The-Fly Sampling Node Pairs Over Graphs Link Prediction across Networks by Biased Cross-Network Sampling 12 - 1:30 PM Lunch 1:30 - 3 PM R25Lineage & Provenance (p. 96) R26Similarity Search (p. 97) SubZero: a Fine-Grained Lineage System for Scientific Databases Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search Logical Provenance in Data-Oriented Workflows Similarity Query Processing for Probabilistic Sets Revision Provenance in Text Documents of Asynchronous Collaboration Top-k String Similarity Search with EditDistance Constraints 18 3:30 - 5 PM THU/11 3 - 3:30 PM Break R29Large Graph Indexing (p. 102) R30 Web Data (p. 103) FERRARI: Flexible and Efficient Reachability Range Assignment for Graph Indexing Breaking the Top-k Barrier of Hidden Web Databases gIceberg: Towards Iceberg Analysis in Large Graphs Automatic Extraction of Top-k Lists from the Web Top-k Graph Pattern Matching over Large Graphs Finding Interesting Correlations with Conditional Heavy Hitters ICDE 2013 Conference Thurs 11 April At a Glance: Research Sessions 10 - 11:30AM Bastille 2 Interval Indexing and Querying on KeyValue Cloud Stores Robust Distributed Stream Processing R24 Data Mining II (p. 87) Learning to Rank from Distant Supervision: Exploiting Noisy Redundancy for Relational Entity Search AFFINITY: Efficiently Querying Statistical Measures on Time-Series Data 11:30 - 12 PM R23 Distributed Data Processing (p. 86) Break Concorde Forecasting the Data Cube: A Model Configuration Advisor for MultiDimensional Data Sets 12 - 1:30 PM Lunch R28 Skyline & Snapshot Query (p. 100) On Shortest Unique Substring Queries On Answering Why-not Questions in Reverse Skyline Queries Engineering Generalized Shortest Path Queries Efficient Direct Search on Compressed Genomic Data Layered Processing of Skyline-Window-Join (SWJ) Queries using Iteration-Fabric Efficient Snapshot Retrieval over Historical Graph Data 3 - 3:30 PM Break Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable? TBF: A Memory-Efficient Replacement Policy for Flash-based Caches Query Optimization for Differentially Private Data Management Systems Fast Peak-to-Peak Behavior with SSD Buffer Pool Top Down Plan Generation: From Theory to Practice SELECT Triggers for Data Auditing THU/11 R32 Data Storage (p. 107) 3:30 - 5 PM R31 Query Optimisation (p. 105) ICDE 2013 Conference 1:30 - 3 PM R27Shortest & Direct Query (p. 98) 19 DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 Monday 8 April Workshops co-located with ICDE 2013 Workshop 1: Data Engineering Meets the Semantic Web – DESWEB 9.00 – 5.00pm Bastille 2 Keynote: Truth Finding on the Deep Web Xin Luna Dong (Google Inc.) Regular Papers WARP: Workload-Aware Replication and Partitioning for RDF Katja Hose (Aalborg University) Ralf Schenkel (Max Planck Institute for Informatics) SESM: Semantic Enrichment of Schema Mappings Yoones A. Sekhavat, Jeffrey Parsons (Memorial University of Newfoundland) Introducing Shadows: Flexible Document Representation and Annotation on the Web Matheus Silva Mota, Claudia Bauzer Medeiros (University of Campinas) Keynote: Learning to Predict Missing Edges in Real-World Interest Graphs: An Infinitely Scalable Cloud Approach Ralf Herbrich (Amazon) Late-breaking Results,Visions and Challenges Automated Educated Guessing Aleksandar Stupar, Sebastian Michel (Saarland University) Eight Fallacies when Querying the Web of Data Jürgen Umbrich (National University of Ireland) Claudio Gutierrez (Universidad de Chile) Aidan Hogan, Marcel Karnstedt, Josiane Xavier Parreira (National University of Ireland) Hybrid Graph and Relational Query Processing in Main Memory Martin Grund, Philippe Cudré-Mauroux (University of Fribourg) Jens Krüger, Hasso Plattner (Hasso Plattner Institute) 20 ICDE 2013 Conference DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 A Vision for SPARQL Multi-Query Optimization on MapReduce Kemafor Anyanwu (North Carolina State University) Recommending Environmental Knowledge As Linked Open Data Cloud Using Semantic Machine Learning Ahsan Morshed, Ritaban Dutta (CSIRO) Jagannath Aryal (University of Tasmania) Workshop 2: Self-Managing Database Systems - SMDB 9.00 – 5.00pm St Germaine Keynote Timos Sellis (RMIT University) Applications of Self-Management Realistic Tenant Traces for Enterprise DBaaS Jan Schaffner (Hasso Plattner Institute) Tim Januschowski (SAP Innovation Center) Model Ensemble Tools for Self-Management in Data Centers Jin Chen (University of Toronto) Gokul Soundararajan (NetApp / University of Toronto) Saeed Ghanbari, Cristiana Amza (University of Toronto) Total Operator State Recall -- Cost-effective Reuse of Results in Greenplum Database George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas (Greenplum, A Division of EMC) Foundations of Self-Management INUM+: A Leaner, More Accurate and More Efficient Fast What-if Optimizer Rui Wang, Quoc Trung Tran, Ivo Jimenez, Neoklis Polyzotis (University of California Santa Cruz) Automatic Schema Design for Co-Clustered Tables Stephan Baumann (Ilmenau University of Technology) Peter Boncz (Centrum Wiskunde & Informatica) Kai-Uwe Sattler (Ilmenau University of Technology) Performance Optimization for Distributed Intra-Node-Parallel Streaming Systems Matthias J. Sax (Humboldt-Universität zu Berlin) Malu Castellanos, Qiming Chen, Meichun Hsu (Hewlett-Packard Laboratories) Self-managing Load Shedding for Data Stream Management Systems Thao N. Pham, Panos K. Chrysanthis, Alexandros Labrinidis (University of Pittsburgh) Panel Self-management and Big Data ICDE 2013 Conference 21 DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 Workshop 3: Privacy-Preserving Data Publication and Analysis PrivDB 9.00 – 5.00pm Concorde Keynote: Challenges to De-anonymization and Privacy protection in Online Advertising Peng Liu (Sohu Inc.) Research Session 1 Empirical Privacy and Empirical Utility of Anonymized Data Graham Cormode, Cecilia M. Procopiuc (AT&T Labs-Research) Entong Shen (North Carolina State University) Divesh Srivastava (AT&T Labs-Research) Ting Yu (North Carolina State University) Privacy-Protecting Index for Outsourced Databases Chung-Min Chen, Andrzej Cichocki, Allen McIntosh, Euthimios Panagos (Applied Communication Sciences) On Syntactic Anonymity and Differential Privacy Chris Clifton (Purdue University) Tamir Tassa (The Open University, Israel) Tutorial : Building Blocks of Privacy: Differentially Private Mechanisms Graham Cormode (AT&T Labs-Research) Invited Talk : Accurate Analysis of Large Private Datasets Vibhor Rastogi (Google Inc.) Research Session 2 On Information Leakage by Indexes over Data Fragments Sabrina De Capitanti di Vimercati, Sara Foresti (Università degli Studi di Milano) Sushil Jajodia (George Mason University) Stefano Paraboschi (Università degli Studi di Bergamo) Pierangela Samarati (Università degli Studi di Milano) Privacy against Aggregate Knowledge Attacks Olga Gkountouna, Katerina Lepenioti (National Technical University of Athens) Manolis Terrovitis (Institute for the Management of Information Systems) 22 ICDE 2013 Conference DETAILED PROGRAM FOR MONDAY 8 APRIL Workshop 4: Mobile Data Analytics - MoDA 9.00 – 12.30pm Ballroom 3 MON/8 Session 1 - Research Papers Client-Centric OLAP on Mobile Devices Zheng Xu, Wo-Shun Luk (Simon Fraser University) Stephen Petchulat (SAP Research Canada) Signature Generation for Sensitive Information Leakage in Android Applications Hiroki Kuzuno, Satoshi Tonami (SECOM) RFID Based Vehicular Networks for Smart Cities Joydeep Paul, Baljeet Malhotra, Simon Dale (SAP Next Business and Technology) Meng Qiang (National University of Singapore) Session 2 - Invited Papers ShareLikesCrowd: Mobile Analytics for Participatory Sensing and Crowd-sourcing Applications Arkady Zaslavsky, Prem Prakash Jayaraman (ICT Centre, CSIRO) Shonali Krishnaswamy (I2R Singapore) Strong Location Privacy: A Case Study on Shortest Path Queries Kyriakos Mouratidis (Singapore Management University) On the Link(s) Between “D” and “A” in Mobile Data Analytics Goce Trajcevski (Northwestern University) Workshop 5: Data-Driven Decision Guidance and Support Systems - DGSS 9.00 – 5.00pm Bastille 1 Keynote: Will Internet of Things Flood DGSS with Data? Arkady Zaslavsky (ICT Centre, CSIRO) Using Military Operational Planning System Data to Drive Reserve Stocking Decisions Rajesh Thiagarajan, Mirza Arif Mekhtiev, Greg Calbert, Nikifor Jeremic, Don Gossink (Defence Science and Technology Organisation) SmartCart: A Consolidated Shopping Cart for Pareto-Optimal Sourcing and Fair Discount Distribution Brian Goodhart, Venkata Yerneni, Alex Brodsky, Venkata Rudraraju, Nathan Egge (George Mason University) ICDE 2013 Conference 23 DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 Water Desalination Supply Chain Modelling and Optimization Malak T. Al-Nory, Stephen C. Graves (Massachusetts Institute of Technology) Using Graphical Models and Multi-attribute Utility Theory for Probabilistic Uncertainty Handling in Large Systems, with Application to the Nuclear Emergency Management Manuele Leonelli, James Q. Smith (The University of Warwick) Multivariate Data-Driven Decision Guidance for Clinical Scientists Frada Burstein, Daswin De Silva (Monash University) Herbert F. Jelinek (Charles Sturt University) Andrew Stranieri (University of Ballarat) ODSS: A Decision Support System for Ocean Exploration Kevin Gomes, Danelle Cline, Duane Edgington, Michael Godin, Thom Maughan, Mike McCann, Tom O'Reilly, Fred Bahr, Francisco Chavez, Monique Messié (Monterey Bay Aquarium Research Institute) Jnaneshwar Das (University of Southern California) Kanna Rajan (Monterey Bay Aquarium Research Institute) Wrap-Up Discussion on the future and organization of DGSS Workshop 6: Graph Data Management: Techniques and Applications - GDM 9.00 – 5.00pm Ballroom 1 Keynote Efficient Processing of Complex Join Queries over Graphs on the Cloud Lei Chen (Hong Kong University of Science and Technology) Research Session 1 SuReQL: A Subgraph Match Based Relational Model for Large Graphs (Short Paper) Shijie Zhang, Jiong Yang, Boya Sun (Case Western Reserve University) Ranking Outlier Nodes in Subspaces of Attributed Graphs Emmanuel Müller (Karlsruhe Institute of Technology / University of Antwerp) Patricia Iglesias Sánchez, Yvonne Mülle, Klemens Böhm (Karlsruhe Institute of Technology) Chordless Cycles in Networks John Pfaltz (University of Virginia) Keynote Haixun Wang (Microsoft Research) 24 ICDE 2013 Conference DETAILED PROGRAM FOR MONDAY 8 APRIL Research Session 2 MON/8 PSOGD: A New Method for Graph Drawing Jianhua Qu (Shandong Normal University), Yi Song, Stéphane Bressan (National University of Singapore) Clustering Remote RDF Data Using SPARQL Update Queries Letao Qi, Harris Lin, Vasant Honavar (Iowa State University) Workshop 7: Data Management in the Cloud - DMC 9.00 – 5.00pm Ballroom 2 Keynote Amr El Abbadi (University of California – Santa Barbara) Paper Session 1 HotROD: Managing Grid Storage with On-Demand Replication Sriram Rao (Microsoft Research) Benjamin Reed (Osmeta Inc.) Adam Silberstein (Trifacta Inc.) Materialized Views for Eventually Consistent Record Stores Changjiu Jin, Rui Liu, Kenneth Salem (University of Waterloo) Packing Light: Portable Workload Performance Prediction for the Cloud Jennie Duggan (Brown University) Yun Chi, Hakan Hacıgümüş, Shenghuo Zhu (NEC Laboratories America) Ugur Çetintemel (Brown University) Paper Session 2 P-Mine: Parallel Itemset Mining on Large Datasets Elena Baralis, Tania Cerquitelli, Silvia Chiusano, Alberto Grand (Politecnico di Torino) Towards Dynamic Pricing-Based Collaborative Optimizations for Green Data Centers Yang Li (University of Pennsylvania) David Chiu (Washington State University) Changbin Liu (AT&T Labs-Research) Linh T.X. Phan, Tanveer Gill, Sanchit Aggarwal, Zhuoyao Zhang, Boon Thau Loo (University of Pennsylvania) David Maier (Portland State University) Bart McManus (Bonneville Power Administration - TOT/DITT2) ISP Business Models in Caching Jörn Künsemöller (University Paderborn) Nan Zhang (Aalto University) João Soares (Portugal Telecom Inovação) Panel Discussion ICDE 2013 Conference 25 DETAILED PROGRAM FOR MONDAY 8 APRIL ICDE-13 PhD Symposium MON/8 9.00 – 5.00pm Odeon It’s all about Data Taming the Metadata Mess V.M. Megler (Portland State University) The rapid growth of scientific data shows no sign of abating. This growth has led to a new problem: with so much scientific data at hand, stored in thousands of datasets, how can scientists find the datasets most relevant to their research interests? We have addressed this problem by adapting Information Retrieval techniques, developed for searching text documents, into the world of (primarily numeric) scientific data. We propose an approach that uses a blend of automated and “semi-curated” methods to extract metadata from large archives of scientific data, then evaluates ranked searches over this metadata. We describe a challenge identified during an implementation of our approach: the large and expanding list of environmental variables captured by the archive do not match the list of environmental variables in the minds of the scientists. We briefly characterize the problem and describe our initial thoughts on resolving it. High Quality Information Provisioning and Data Pricing Florian Stahl (University of Münster) This paper presents ideas on how to advance the research on high quality information provisioning and information pricing. To this end, the current state of the art in combining data curation and information provisioning as well as in regard to data pricing is reviewed. Based on that, open issues, such as tailoring data to a user’s need and determining a market value of data, are identified. As preliminary solutions, it is proposed to investigate the identified problems in an integrated manner. A Framework of Ontology Guided Data Linkage for Evidence based Knowledge Extraction and Information Sharing Mohammed Gollapalli (The University of Queensland) There has been a surge of interests in developing probabilistic techniques for linking semantic equivalent datasets. The key objective is to transform the structure of the induced data into a concise synopsis. Current techniques primarily focus on performing pair-wise attribute matching and pay little attention in discovering direct and weighted correlations among ontological clusters through multi-faceted classification. In this paper, we introduce a novel Ontology Guided Data Linkage (OGDL) framework for self-organising and discovering schema structures through constructing a hierarchical cluster mapping trees. Furthermore, we extend our OGDL framework by introducing a novel faceted search engine for semantic interoperability of data and subsequent decision support analysis, and use it to map 26 ICDE 2013 Conference DETAILED PROGRAM FOR MONDAY 8 APRIL fast cluster browsing, user friendly querying and semantic reasoning learning needs. It’s now about Database (and other) Systems MON/8 On Answering Why and Why-not Questions in Databases Md. Saiful Islam (Swinburne University of Technology) There is a growing interest in allowing users to ask questions on received results in the hope of improving the usability of database systems. This research aims at answering the so called why and why-not questions on received results w.r.t. different query settings in databases. The main goals of this research are: (i) studying the problem of answering the why and the why-not questions in databases; (ii) finding efficient strategies for answering these questions in terms of different query settings and (iii) finally, developing a framework that can take advantage of the existing data indexing and query evaluation techniques available to answer such questions in databases. We believe that the research undertaken by us can contribute towards improving the usability of traditional database systems. Towards Elastic Key-value Stores on IaaS Han Li (University of New South Wales) Key-value stores such as Cassandra and HBase have gained popularity as for their scalability and high availability in the face of heavy workloads and hardware failure. Many enterprises are deploying applications backed by key-value stores on to resources leased from Infrastructure as a Service (IaaS) providers. However, current key-value stores are unable to take full advantage of the resource elasticity provided by IaaS providers due to several challenges: i) high performance of data access in virtualised environments; ii) load-rebalancing as the system scales up and down; and iii) the lack of autoscaling controllers. In this paper I present my research efforts on addressing these issues to provide an elastic key-value store deployed in IaaS environments. User-Oriented Modelling of Scientific Workflows for High Frequency Event Data Analysis Aarthi Natarajan (University of New South Wales) Whether it is research scientists in computational physics, astronomy, environmental science, genomics or financial services, all these varying disciplines have been challenged by the analysis of Big Data. They are all required to perform multi-step analysis tasks to turn this data into actionable insight, from which critical decisions can be made. Two data processing models that have rapidly evolved in the past decade to support data analysts are Complex Event Processing and Scientific Workflows. Our research adds a new dimension to scientific workflows, by extending them to incorporate the handling of event-streams and aims to provide a more efficient and faster approach to analyse vast amounts of data. Our model also aims at facilitating conceptual modelling of analytical processes - to enable domain ICDE 2013 Conference 27 DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 experts to build abstract, exploratory analysis processes in a user-friendly manner without the concerns of underlying technology and transparently maps them to concrete implementations at run-time. Short Presentations Self-organizing Structured RDF in MonetDB Minh-Duc Pham (Centrum Wiskunde & Informatica) The semantic web uses RDF as its data model, providing ultimate flexibility for users to represent and evolve data without need of a schema. Yet, this flexibility poses challenges in implementing efficient RDF stores, leading from plans with very many self-joins to a triple table, difficulties to optimize these, and a lack of data locality since without a notion of multi-attribute data structure, clustered indexing opportunities are lost. Apart from performance issues, users of huge RDF graphs often have problems formulating queries as they lack any system-supported notion of the structure in the data. In this research, we exploit the observation that real RDF data, while not as regularly structured as relational data, still has the great majority of triples conforming to regular patterns. We conjecture that a system that would recognize this structure automatically would both allow RDF stores to become more efficient and also easier to use. Concretely, we propose to derive self-organizing RDF that stores data in PSO format in such a way that the regular parts of the data physically correspond to relational columnar storage; and propose RDFscan/RDFjoin algorithms that compute star-patterns over these without wasting effort in self-joins. These regular parts, i.e. tables, are identified on ingestion by a schema discovery algorithm -- as such users will gain an SQL view of the regular part of the RDF data. This research aims to produce a state-of-the-art SPARQL frontend for MonetDB as a by-product, and we already present some preliminary results on this platform. E-Research Event Data Quality Weisi Chen (University of New South Wales) One of the most important data types e-Researchers use to conduct analysis processes is “event data”, which records information of some timed events in a particular domain. However, real-world event data is usually of poor quality, resulting in large amounts of money and labour to tackle the ensuing problems. Existing solutions to event data quality are very limited, mostly supporting merely data quality in general without facilitating the ease of event pattern detection; existing event processing systems, on the other hand, are very inefficient in dealing with data quality issues. In this research, we have summarised the criteria to address event data quality issues and compared possible solutions including knowledge-based systems and event processing systems. We conclude by proposing an approach that combines a rule-based system with an event processing system in a novel way. 28 ICDE 2013 Conference DETAILED PROGRAM FOR MONDAY 8 APRIL MON/8 Indexing and Querying Moving Objects In Indoor Spaces Sultan Alamri (Monash University) Spatial database indexes are basically designed to speed up retrievals where it is usually assumed that the objects of interest are constant unless conspicuously updated. Therefore, capturing continuously moving objects in traditional spatial indexes will require frequent updates of the locations of these objects. This paper outlines a PhD thesis that addresses the challenges of indexing the moving objects in indoor spaces. The main goal of this thesis is to develop new indoor index structures for moving objects focusing on the following four challenges: (1) introducing a queries taxonomy for moving objects to illustrate the query types for the databases of moving objects; (2) introducing an adjacency index structure for moving objects in indoor spaces; (3) capturing both spatial and temporal properties in an indoor data structure; (4) introducing an index structure for moving objects in indoor spaces that is based on a specific type of movement pattern. Stock Prediction by Searching Similar Candlestick Charts Zen-Yu Quan (National Central University, Taiwan) This research applies the content-based image retrieval (CBIR) technique for stock prediction. In particular, low-level image features, including wavelet texture and Canny edge are extracted from candlestick charts. Then, similar historical candlestick charts represented by the low-level features to the query chart are retrieved, in which the ‘future’ stock movements of the retrieved charts are used for predicting the stock price of the query chart. News Recommendation Based on Web Usage and Web Content Mining Husna Sarirah Husin, (RMIT University) In the last decade, online newspapers have become a viable alternative to conventional hardcopy papers. Many studies have shown that digital media have increased their share of Internet audience. In this study, we use Web usage and Web content mining techniques to recommend news articles to users. We are using Web server logs from a Malaysian newspaper, Berita Harian that will be combined with the Web content pages to discover the web users’ navigational patterns. We plan to improve existing Web usage mining techniques in deriving user profiles and find a novel way to combine the user profiles with Web content pages. Making the H-index More Relevant: A Step Towards Standard Classes For Citation Classification Mohammad Abdullatif (The University of Auckland) The H-index is gaining popularity as a way of measuring the research impact of an academic paper. However, it has been criticized because it gives all citations equal weight. Citation classification can solve this criticism by categorising citations based on the purpose or function of the citation. An important element for performing ICDE 2013 Conference 29 MON/8 DETAILED PROGRAM FOR MONDAY 8 APRIL citation classification is the presence of a standard set of classes (known as a classification scheme) to enable the comparison between the accuracy of the different techniques currently used to perform citation classification. Such standard scheme is not available and therefore we aim to fill this gap be generating a citation classification scheme automatically. The scheme is generated by clustering 4 large datasets of sentences containing citations using X-means. The main contribution for this research is adapting the similarity distance between verbs extracted from the citation sentences using WordNet. Moderated Discussion How to Survive as a Ph.D. Student – The Do’s and Don’t’s Contributions from Alan Fekete (University of Sydney) Johann Christoph Freytag (Humboldt University), Gottfried Vossen (University of Münster) and others. 30 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Tuesday 9 April Conference Opening 9 - 9:30 AM Chair: Beng Chin Ooi (National University of Singapore)Ballroom 1 & 2 Keynote 1 9:30 - 10:30AMChair: Xiaofang Zhou (The University of Queensland) Ballroom 1 & 2 Re-thinking the Performance of Information Processing Systems TUE/9 Vishal Sikka (SAP AG) Abstract: Recent advances in hardware and software technologies have enabled us to re-think how we architect databases to meet the demands of today’s information systems. However, this makes existing performance evaluation metrics obsolete. In this paper, I describe SAP HANA a novel, powerful database platform that leverages the availability of large main memory and massively parallel processors. Based on this, I propose a new, multi-dimensional performance metric that better reflects the value expected from today’s complex information systems. Bio: Dr. Vishal Sikka is a member of the Executive Board of SAP AG, heading technology and innovation for the company. Sikka has responsibility for technology and platform products, including database, especially the industry breakthrough in-memory database SAP HANA, as well as analytics, mobile, application platform, and middleware. He drives emerging technologies and advanced development for the next-generation technology platform, applications, and tools. He also oversees key technology partnerships, customer co-innovation, and incubation of emerging businesses. He has global responsibility for SAP Research, as well as academic and government relations. Sikka has been Chief Technology Officer of SAP since 2007, responsible for the overall technology, architecture, and product standards across the entire SAP product portfolio. He is the creator of the concept of “timeless software,” which underpins SAP architecture and innovation strategy. Sikka holds a Doctorate in Computer Science from Stanford University in California, and his experience includes research in Artificial Intelligence, Programming Models and Automatic Programming, as well as Information Management and Integration – at Stanford, at Xerox Palo Alto Labs, and as founder of two startup companies. ICDE 2013 Conference 31 DETAILED PROGRAM FOR TUESDAY 9 APRIL Research 1: Main Memory Databases 11AM - 12:30PM Chair: Philippe Cudre-Mauroux (MIT) Ballroom 1 TUE/9 CPU and Cache Efficient Management of Memory-Resident Databases Holger Pirk (Centrum Wiskunde & Informatica) Florian Funke (Technische Universität München) Martin Grund (Hasso Plattner Institute) Thomas Neumann (Technische Universität München) Ulf Leser (Humboldt Universität zu Berlin) Stefan Manegold (Centrum Wiskunde & Informatica) Alfons Kemper (Technische Universität München) Martin Kersten (Centrum Wiskunde & Informatica) Memory-Resident Database Management Systems (MRDBMS) have to be optimized for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model (PDSM) has been proposed. However, in current implementations, bandwidth savings achieved by partial decomposition come at increased CPU costs. To achieve the aspired bandwidth savings without sacrificing CPU efficiency, we combine partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus eliminating CPU inefficient function calls. Since existing cost based optimization components are not designed for JiT-compiled query execution, we also develop a novel approach to cost modeling and subsequent storage layout optimization. Our evaluation shows that the JiT-based processor maintains the bandwidth savings of previously presented hybrid query processors but outperforms them by two orders of magnitude due to increased CPU efficiency. Identifying Hot and Cold Data in Main-Memory Databases Justin Levandoski (Microsoft Research) Per-Åke Larson (Microsoft Research) Radu Stoica (École Polytechnique Fédérale de Lausanne) Main memories are becoming sufficiently large that most OLTP databases could be stored entirely in main memory, but this may not be the best solution. OLTP workloads typically exhibit skewed access patterns where some records are hot (frequently accessed) but many records are cold (infrequently or never accessed). It is more economical to store the coldest records on secondary storage such as flash. As a first step towards managing cold data in main-memory databases we investigate how to efficiently identify hot and cold data. We propose to log record accesses, possibly only a sample to reduce overhead, and perform offline analysis to estimate record access frequencies. We present four estimation algorithms based on exponential smoothing and experimentally evaluate their efficiency and accuracy. We find that exponential smoothing provides very accurate estimates and closeto-perfect classification. Our most efficient algorithm is able to analyze a log of 1B accesses in sub-second time on a workstation-class machine. 32 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases TUE/9 Viktor Leis, Alfons Kemper, Thomas Neumann (Technische Universität München) Main memory capacities have grown up to a point where most databases fit into RAM. For main-memory database systems, index structure performance is a critical bottleneck. Traditional in-memory data structures like balanced binary search trees are not efficient on modern hardware, because they do not optimally utilize on-CPU caches. Hash tables, also often used for main-memory indexes, are fast but only support point queries. To overcome these shortcomings, we present ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup performance surpasses highly tuned, read-only search trees, while supporting very efficient insertions and deletions as well. At the same time, ART is very space efficient and solves the problem of excessive worst-case space consumption, which plagues most radix trees, by adaptively choosing compact and efficient data structures for internal nodes. Even though ART’s performance is comparable to hash tables, it maintains the data in sorted order, which enables additional operations like range scan and prefix lookup. Research 2: MapReduce Algorithms 11AM - 12:30PM Chair: Shivnath Babu (Duke University) St Germaine Finding Connected Components in Map-reduce in Logarithmic Rounds Vibhor Rastogi (Google Inc.) Ashwin Machanavajjhala (Duke University) Laukik Chitnis, Anish Das Sarma (Google Inc.) Given a large graph G = (V,E) with millions of nodes and edges, how do we compute its connected components efficiently? Recent work addresses this problem in map-reduce, where a fundamental trade-off exists between the number of map-reduce rounds and the communication of each round. Denoting d the diameter of the graph, and n the number of nodes in the largest component, all prior techniques for map-reduce either require a linear, Θ(d), number of rounds, or a quadratic, Θ(n|V | + |E|), communication per round. We propose here two efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | + |E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm, provably finishing in O(log n) iterations for path graphs. The proof technique used for Hash-to-Min is novel, but not tight, and it is actually faster than Hash- Greaterto-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V | + |E|) communication per round, as demonstrated in our experiments. Using secondary sorting, a standard mapreduce feature, we scale Hash-to-Min to graphs with very large connected components. Our techniques for connected components can be applied to clustering as well. We propose a novel algorithm for agglomerative single linkage clustering in map-reduce. This is the first mapreduce algorithm for clustering ICDE 2013 Conference 33 DETAILED PROGRAM FOR TUESDAY 9 APRIL in at most O(log n) rounds, where n is the size of the largest cluster. We show the effectiveness of all our algorithms through detailed experiments on large synthetic as well as real-world datasets. TUE/9 Enumerating Subgraph Instances Using Map-Reduce Foto N. Afrati, Dimitris Fotakis (National Technical University of Athens) Jeffrey D. Ullman (Stanford University) The theme of this paper is how to find all instances of a given ``sample’’ graph in a larger “data graph”, using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE~2011) for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be “convertible”, in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm. Scalable Maximum Clique Computation Using MapReduce Jingen Xiang, Cong Guo, Ashraf Aboulnaga (University of Waterloo) We present a scalable and fault-tolerant solution for the maximum clique problem based on the MapReduce framework. The key contribution that enables us to effectively use MapReduce is a recursive partitioning method that partitions the graph into several subgraphs of similar size. After partitioning, the maximum cliques of the different partitions can be computed independently, and the computation is sped up using a branch and bound method. Our experiments show that our approach leads to good scalability, which is unachievable by other partitioning methods since they result in partitions of different sizes and hence lead to load imbalance. Our method is more scalable than an MPI algorithm, and is simpler and more fault tolerant. 34 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Research 3: Time Travel in Database 11AM - 12:30PM Chair: Robert Ikeda (Stanford University) Bastille 1 Ficklebase: Looking into the Future to Erase the Past TUE/9 Sumeet Bajaj, Radu Sion (Stony Brook University) It has become apparent that in the digital world data once stored is never truly deleted even when such an expunction is desired either as a normal system function or for regulatory compliance purposes. Forensic Analysis techniques on systems are often successful at recovering information said to have been deleted in the past. Efforts aimed at thwarting such forensic analysis of systems have either focused on (i) identifying the system components where deleted data lingers and performing a secure delete operation over these remnants, or (ii) designing history independent data structures that hide information about past operations which result in the current system state. Yet, new data is constantly derived by processing existing (input) data which makes it increasingly difficult to remove all traces of this existing data, i.e., for regulatory compliance purposes. Even after deletion, significant information can linger in and be recoverable from the side effects the deleted data records left on the currently available state. In this paper we address this aspect in the context of a relational database, such that when combined with (i) & (ii), complete erasure of data and its effects can be achieved (“untraceable deletion”). We introduce Ficklebase – a relational database wherein once a tuple has been “expired” – any and all its side effects are removed, thereby eliminating all its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is undetectable. We present the design and evaluation of Ficklebase, and then discuss several of the fundamental functional implications of un-traceable deletion. Time Travel in a Scientific Array Database Emad Soroush, Magdalena Balazinska (University of Washington) In this paper, we present TimeArr, a new storage manager for an array database. TimeArr supports the creation of a sequence of versions of each stored array and their exploration through two types of time travel operations: selection of a specific version of a (sub)-array and a more general extraction of a (sub)-array history, in the form of a series of (sub)-array versions. TimeArr contributes a combination of array-specific storage techniques to efficiently support these operations. To speed-up array exploration, TimeArr further introduces two additional techniques. The first is the notion of approximate time travel with two types of operations: approximate version selection and approximate history. For these operations, users can tune the degree of approximation tolerable and thus trade-off accuracy and performance in a principled manner. The second is to lazily create short connections, called skip links, between the same (sub)-arrays at different versions with similar data patterns to speed up the selection of a specific version. We implement TimeArr ICDE 2013 Conference 35 DETAILED PROGRAM FOR TUESDAY 9 APRIL within the SciDB array processing engine and demonstrate its performance through experiments on two real datasets from the astronomy and earth sciences domains. TUE/9 Time Travel in Column Stores Martin Kaufmann, Amin A. Manjili (ETH Zürich / SAP AG) Stefan Hildenbrand, Donald Kossmann (ETH Zürich) Andreas Tonder (SAP AG) Recent studies have shown that column stores can outperform row stores significantly. This paper explores alternative approaches to extend column stores with versioning, i.e., time travel queries and the maintenance of historic data. On the one hand, adding versioning can actually simplify the design of a column store because it provides a solution for the implementation of updates, traditionally a weak point in the design of column stores. On the other hand, implementing a versioned column store is challenging because it imposes a two dimensional clustering problem: should the data be clustered by row or by version? This paper devises the details of three memory layouts: clustering by row, clustering by version, and hybrid clustering. Performance experiments demonstrate that all three approaches outperform a (traditional) versioned row store. The efficiency of these three memory layouts depends on the query and update workload. Furthermore, the performance experiments analyze the time-space tradeoff that can be made in the implementation of versioned column stores. Research 4: Top-k Query in Uncertain Data 11 - 12:30PM Chair: Wenjie Zhang (University of New South Wales) Bastille 2 Top-k Query Processing in Probabilistic Databases with Non-Materialized Views Maximilian Dylla, Iris Miliaraki (Max Planck Institute for Informatics) Martin Theobald (University of Antwerp) We investigate a novel approach of computing confidence bounds for top-k ranking queries in probabilistic databases with non-materialized views. Unlike related approaches, we present an exact pruning algorithm for finding the top-ranked query answers according to their marginal probabilities without the need to first materialize all answer candidates via the views. Specifically, we consider conjunctive queries over multiple levels of select-project-join views, the latter of which are cast into Datalog rules which we ground in a top-down fashion directly at query processing time. To our knowledge, this work is the first to address integrated data and confidence computations for intensional query evaluations in the context of probabilistic databases by considering confidence bounds over first-order lineage formulas. We extend our query processing techniques by a tool-suite of scheduling strategies based on selectivity estimation and the expected impact on confidence bounds. Further extensions to our query processing strategies include improved top-k bounds in the case when sorted relations are available as input, as well as 36 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL the consideration of recursive rules. Experiments with large datasets demonstrate significant runtime improvements of our approach compared to both exact and sampling-based top-k methods over probabilistic data. Cleaning Uncertain Data for Top-k Queries TUE/9 Luyi Mo, Reynold Cheng, Xiang Li, David W. Cheung, Xuan S. Yang (The University of Hong Kong) The information managed in emerging applications, such as sensor networks, location-based services, and data integration, is inherently imprecise. To handle data uncertainty, probabilistic databases have been recently developed. In this paper, we study how to quantify the ambiguity of answers returned by a probabilistic top-k query. We develop efficient algorithms to compute the quality of this query under the possible world semantics. We further address the cleaning of a probabilistic database, in order to improve top-k query quality. Cleaning involves the reduction of ambiguity associated with the database entities. For example, the uncertainty of a temperature value acquired from a sensor can be reduced, or cleaned, by requesting its newest value from the sensor. While this cleaning operation may produce a better query result, it may involve a cost and fail. We investigate the problem of selecting entities to be cleaned under a limited budget. Particularly, we propose an optimal solution and several heuristics. Experiments show that the greedy algorithm is efficient and close to optimal. Top-K Oracle: A New Way to Present Top-K Tuples for Uncertain Data Chunyao Song, Zheng Li, Tingjian Ge (University of Massachusetts, Lowell) Managing noisy and uncertain data is needed in a great number of modern applications. A major difficulty in managing such data is the sheer number of query result tuples with diverse probabilities. In many cases, users have a preference over the tuples in a deterministic world, determined by a scoring function. Yet it has been a challenging problem to return top-k for uncertain data. Various semantics have been proposed, and they have been shown to give wildly different tuple rankings. In this paper, we propose a completely different approach. Instead of returning users k tuples, which are merely one point in the complex distribution of top-k tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns its result. Any information we give is based on faithful, best-effort estimates of the ground-truth top-k tuples. This is especially critical in emergency response applications and in monitoring top-k applications. Furthermore, we are the first to provide the nested query capability with the uncertain top-k result being a subquery. We devise various query processing algorithms for top-k oracles, and verify their efficiency and accuracy through a systematic evaluation over real-world and synthetic datasets. ICDE 2013 Conference 37 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Seminar 1: Machine Learning on Big Data 11AM - 12:30PM Ballroom 2 Tyson Condie, Paul Mineiro (Microsoft Research, USA) Neoklis Polyzotis (University of California, Santa Cruz) Markus Weimer (Microsoft Research, USA) Statistical Machine Learning has undergone a phase transition from a pure academic endeavor to being one of the main drivers of modern commerce and science. Even more so, recent results such as those on tera-scale learning and on very large neural networks suggest that scale is an important ingredient in quality modeling. This tutorial introduces current applications, techniques and systems with the aim of cross-fertilizing research between the database and machine learning communities. The tutorial covers current large scale applications of Machine Learning, their computational model and the workflow behind building those. Based on this foundation, we present the current state-of-the-art in systems support in the bulk of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the closing of the seminar, where we introduce two sets of open research questions: Better systems support for the already established use cases of Machine Learning and support for recent advances in Machine Learning research Seminar 2: Big Data Integration 11AM - 12:30PM Odeon Xin Luna Dong, Divesh Srivastava (AT&T Labs-Research) The Big Data era is upon us: data is being generated, collected and analyzed at an unprecedented scale, and data-driven decision making is sweeping through all aspects of society. Since the value of data explodes when it can be linked and fused with other data, addressing the big data integration (BDI) challenge is critical to realizing the promise of Big Data. BDI differs from traditional data integration in many dimensions: (i) the number of data sources, even for a single domain, has grown to be in the tens of thousands, (ii) many of the data sources are very dynamic, as a huge amount of newly collected data are continuously made available, (iii) the data sources are extremely heterogeneous in their structure, with considerable variety even for substantially similar entities, and (iv) the data sources are of widely differing qualities, with significant differences in the coverage, accuracy and timeliness of data provided. This seminar explores the progress that has been made by the data integration community on the topics of schema mapping, record linkage and data fusion in addressing these novel challenges faced by big data integration, and identifies a range of open problems for the community. 38 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Industry 1 11AM - 12:30pm Chair: Stelios Paparizos (Microsoft) Concorde Invited Talk: Big Data Analytics at Facebook TUE/9 Ravi Murthy (Facebook) The data analytics infrastructure at Facebook has evolved rapidly over the last few years. The amount of data managed by the platform has grown dramatically, as well as the increasing need to analyze the data in different ways. Several large scale systems have been developed to ingest and crunch petabytes of data, and turn them into insights and measurements that are used to build products and services for Facebook's 1-billion active users. This talk will cover the overall architecture and the design of specialized systems used for batch analytics, interactive realtime analysis and graph analytics. Each of these systems are designed with unique tradeoffs, but integrate together to provide a comprehensive analytics platform. We will discuss the challenges faced and lessons learnt while growing these systems to unprecedented scale - 100s of petabytes, thousands of machines. We will also present current challenges and opportunities that can help drive research and innovation for the next generation of big data platforms. Invited Paper: Data Services for E-tailers Leveraging Search Engine Assets Tao Cheng, Kaushik Chakrabarti, Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala (Microsoft Research) Retail is increasingly moving online. There are only a few big e-tailers but there is a long tail of small-sized e-tailers. The big e-tailers are able to collect significant data on user activities at their websites. They use these assets to derive insights about their products and to provide superior experiences for their users. On the other hand, small e-tailers do not possess such user data and hence cannot match the rich user experiences offered by big e-tailers. Our key insight is that web search engines possess significant data on user behaviors that can be used to help smaller e-tailers mine the same signals that big e-tailers derive from their proprietary user data assets. These signals can be exposed as data services in the cloud; e-tailers can leverage them to enable similar user experiences as the big e-tailers. We present three such data services in the paper: entity synonym data service, query-toentity data service and entity tagging data service. The entity synonyms service is an in-production data service that is currently available while the other two are data services currently in development at Microsoft. Our experiments on product datasets show (i) these data services have high quality and (ii) they have significant impact on user experiences on e-tailer websites. To the best of our knowledge, this is the first paper to explore the potential of using search engine data assets for e-tailers. ICDE 2013 Conference 39 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Invited Paper: SAP HANA Distributed In-Memory Database System: Transaction, Session, and Metadata Management Juchang Lee, Yong Sik Kwon (SAP Labs, Korea) Franz Färber, Michael Muehle (SAP AG) Chulwon Lee (SAP Labs, Korea) Christian Bensberg (SAP AG) Joo Yeon Lee (SAP Labs, Korea) Arthur H. Lee (Claremont McKenna College / SAP Labs, Korea) Wolfgang Lehner (Dresden University of Technology / SAP AG) One of the core principles of the SAP HANA database system is the comprehensive support of distributed query facility. Supporting scale-out scenarios was one of the major design principles of the system from the very beginning. Within this paper, we first give an overview of the overall functionality with respect to data allocation, metadata caching and query routing. We then dive into some level of detail for specific topics and explain features and methods not common in traditional diskbased database systems. In summary, the paper provides a comprehensive overview of distributed query processing in SAP HANA database to achieve scalability to handle large databases and heterogeneous types of workloads. Research 5: Uncertainty in Spatial Data 2 - 3:30PM Chair: Wei Wang (University of New South Wales) Ballroom 1 Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain Databases Peiwu Zhang, Reynold Cheng, Nikos Mamoulis (The University of Hong Kong) Matthias Renz, Andreas Züfle (Ludwig-Maximilians-Universität München) Yu Tang (The University of Hong Kong) Tobias Emrich (Ludwig-Maximilians-Universität München) In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a database can be used to check whether p is the closest to some query point q. We extend the notion of Voronoi cells to support uncertain objects, whose attribute values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell). A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any point p Є R, o may be the nearest neighbor of p. If the PV-cells of all objects in a database S are known, they can be used to identify objects that have a chance to be the nearest neighbor of q. However, there is no efficient algorithm for computing an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle (called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic nearest neighbor queries over uncertain data. An advantage of the PV- index is that upon updates on S, it can be incrementally updated. Extensive experiments on both synthetic and real datasets are carried out to validate the performance of the PVindex. 40 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Interval Reverse Nearest Neighbor Queries on Uncertain Data with Markov Correlations TUE/9 Chuanfei Xu, Yu Gu (Northeastern University) Lei Chen (Hong Kong University of Science and Technology) Jianzhong Qiao, Ge Yu (Northeastern University) Nowadays, many applications return to the user a set of results that take the query as their nearest neighbor, which are commonly expressed through reverse nearest neighbor (RNN) queries. When considering moving objects, users would like to find objects that appear in the RNN result set for a period of time in some real-world applications such as collaboration recommendation and anti-tracking. In this work, we formally define the problem of interval reverse nearest neighbor (IRNN) queries over moving objects, which return the objects that maintain nearest neighboring relations to the moving query object for the longest time in the given interval. Location uncertainty of moving data objects and moving query objects is inherent in various domains, and we investigate objects that exhibit Markov correlations, that is, each object’s location is only correlated with its own location at previous timestamp while being independent of other objects. There exists the efficiency challenge for answering IRNN queries on uncertain moving objects with Markov correlations since we have to retrieve not only all the possible locations of each object at current time but also its historically possible locations. To speed up the query processing, we present a general framework for answering IRNN queries on uncertain moving objects with Markov correlations in two phases. In the first phase, we apply space pruning and probability pruning techniques, which reduce the search space significantly. In the second phase, we verify whether each unpruned object is an IRNN of the query object. During this phase, we propose an approach termed Probability Decomposition Verification (PDV) algorithm which avoid computing the probability of any object being an RNN of the query object exactly and thus improve the efficiency of verification. The performance of the proposed algorithm is demonstrated by extensive experiments on synthetic and real datasets, and the experimental results show that our algorithm is more efficient than the MonteCarlo based approximate algorithm. Efficient Tracking and Querying for Coordinated Uncertain Mobile Objects Nicholas D. Larusso, Ambuj Singh (University of California Santa Barbara) Accurately estimating the current positions of moving objects is a challenging task due to the various forms of data uncertainty (e.g. limited sensor precision, periodic updates from continuously moving objects). However, in many cases, groups of objects tend to exhibit similarities in their movement behavior. For example, vehicles in a convoy or animals in a herd both exhibit tightly coupled movement behavior within the group. While such statistical dependencies often increase the computational complexity necessary for capturing this additional structure, they also provide useful information which can be utilized to provide more accurate ICDE 2013 Conference 41 DETAILED PROGRAM FOR TUESDAY 9 APRIL location estimates. In this paper, we propose a novel model for accurately tracking coordinated groups of mobile uncertain objects. We introduce an exact and more efficient approximate inference algorithm for updating the current location of each object upon the arrival of new (uncertain) location observations. Additionally, we derive probability bounds over the groups in order to process probabilistic threshold range queries more efficiently. Our experimental evaluation shows that our proposed model can provide 4X improvements in tracking accuracy over competing models which do not consider group behavior. We also show that our bounds enable us to prune up to 50% of the database, resulting in more efficient processing over a linear scan. Research 6: Data Extraction TUE/9 2 - 3:30PM Chair: Luna Dong (Google Inc.) St Germaine Attribute Extraction and Scoring: A Probabilistic Approach Taesung Lee (Pohang University of Science and Technology / Microsoft Research Asia) Zhongyuan Wang (Renmin University of China / Microsoft Research Asia) Haixun Wang (Microsoft Research Asia) Seung-won Hwang (Pohang University of Science and Technology) Knowledge bases, which consist of concepts, entities, attributes and relations, are increasingly important in a wide range of applications. We argue that knowledge about attributes (of concepts or entities) plays a critical role in inferencing. In this paper, we propose methods to derive attributes for millions of concepts and we quantify the typicality of the attributes with regard to their corresponding concepts. We employ multiple data sources such as web documents, search logs, and existing knowledge bases, and we derive typicality scores for attributes by aggregating different distributions derived from different sources using different methods. To the best of our knowledge, ours is the first approach to integrate concept- and instancebased patterns into probabilistic typicality scores that scale to broad concept space. We have conducted extensive experiments to show the effectiveness of our approach. TYPifier: Inferring the Type Semantics of Structured Data Yongtao Ma, Thanh Tran (Karlsruhe Institute of Technology) Veli Bicer (IBM Research Ireland) Structured data representing entity descriptions often lacks precise type information. That is, it is not known to which type an entity belongs to, or the type is too general to be useful. In this work, we propose to deal with this novel problem of inferring the type semantics of structured data, called typification. We formulate it as a clustering problem and discuss the features needed to obtain several solutions based on existing clustering solutions. Because schema features perform best, but 42 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL are not abundantly available, we propose an approach to automatically derive them from data. Optimized for the use of schema features, we present TYPifier, a novel clustering algorithm that in experiments, yields better typification results than the baseline clustering solutions. For entity resolution, which represents one of the possible use cases, we show that the inferred type information helps to produce better results. SUSIE: Search Using Services and Information Extraction TUE/9 Nicoleta Preda (University of Versailles) Fabian Suchanek (Max Planck Institute for Informatics) Wenjun Yuan (University of Versailles / The University of Hong Kong) Gerhard Weikum (Max Planck Institute for Informatics) The API of a Web service restricts the types of queries that the service can answer. For example, a Web service might provide a method that returns the songs of a given singer, but it might not provide a method that returns the singers of a given song. If the user asks for the singer of some specific song, then the Web service cannot be called -- even though the underlying database might have the desired piece of information. This asymmetry is an inherent limitation for systems that aim to use Web services for service orchestration, query answering, or ontology extension. In this paper, we propose to use on-the-fly information extraction to collect values that can be used as parameter bindings for the Web service. As a result, a Web service that returns songs can be used ``backwards” to find also the singers. Our approach is fully implemented in a prototype called SUSIE. We present experiments with real-life data and services to demonstrate the practical viability and good performance of our approach. Research 7: Trajectory Databases 2 - 3:30PM Chair: Yin Yang (Advanced Digital Sciences Center) Bastille 1 Towards Efficient Search for Activity Trajectories Kai Zheng (The University of Queensland) Shuo Shang (Aalborg University) Nicholas Jing Yuan (Microsoft Research Asia) Yi Yang (Carnegie Mellon University) The advances in location positioning and wireless communication technologies have led to a myriad of spatial trajectories representing the mobility of a variety of moving objects. While processing trajectory data with the focus of spatio-temporal features has been widely studied in the last decade, recent proliferation in location-based web applications (e.g., Foursquare, Facebook) has given rise to large amounts of trajectories associated with activity information, called activity trajectory. In this paper, we study the problem of efficient similarity search on activity trajectory database. Given a sequence of query locations, each associated with a set of desired activities, an activity trajectory similarity query (ATSQ) returns $k$ trajectories that cover the query activities and yield the shortest minimum match distance. An order-sensitive ICDE 2013 Conference 43 DETAILED PROGRAM FOR TUESDAY 9 APRIL activity trajectory similarity query (OATSQ) is also proposed to take into account the order of the query locations. To process the queries efficiently, we firstly develop a novel hybrid grid index, GAT, to organize the trajectory segments and activities hierarchically, which enables us to prune the search space by location proximity and activity containment simultaneously. In addition, we propose algorithms for efficient computation of the minimum match distance and minimum order-sensitive match distance, respectively. The results of our extensive empirical studies based on real online check-in datasets demonstrate that our proposed index and methods are capable of achieving superior performance and good scalability. TUE/9 On Discovery of Gathering Patterns from Trajectories Kai Zheng (The University of Queensland) Yu Zheng, Nicholas Jing Yuan (Microsoft Research Asia) Shuo Shang (Aalborg University) The increasing pervasiveness of location-acquisition technologies has enabled collection of huge amount of trajectories for almost any kind of moving objects. Discovering useful patterns from their movement behaviours can convey valuable knowledge to a variety of critical applications. In this light, we propose a novel concept, called gathering, which is a trajectory pattern modelling varies group incidents such as celebrations, parades, protests, traffic jams and so on. A key observation is that these incidents typically involve large congregations of individuals, which form durable and stable areas with high density. Since the process of discovering gathering patterns over large-scale trajectory databases can be quite lengthy, we further develop a set of well thought out techniques to improve the performance. These techniques, including effective indexing structures, fast pattern detection algorithms implemented with bit vectors, and incremental algorithms for handling new trajectory arrivals, collectively constitute an efficient solution for this challenging task. Finally, the effectiveness of the proposed concepts and the efficiency of the approaches are validated by extensive experiments based on a real taxicab trajectory dataset. Destination Prediction by Sub-Trajectory Synthesis and Privacy Protection Against Such Prediction 44 Andy Yuan Xue, Rui Zhang (The University of Melbourne) Yu Zheng, Xing Xie (Microsoft Research Asia) Jin Huang, Zhenghua Xu (The University of Melbourne) Destination prediction is an essential task for many emerging location based applications such as recommending sightseeing places and targeted advertising based on destination. A common approach to destination prediction is to derive the probability of a location being the destination based on historical trajectories. However, existing techniques using this approach suffer from the ``data sparsity problem”, i.e., the available historical trajectories is far from being able to cover all possible trajectories. This problem considerably limits the number of query trajectories that can obtain predicted destinations. We propose a novel method ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL named Sub-Trajectory Synthesis (SubSyn) algorithm to address the data sparsity problem. SubSyn algorithm first decomposes historical trajectories into subtrajectories comprising two neighbouring locations, and then connects the subtrajectories into ``synthesised’’ trajectories. The number of query trajectories that can have predicted destinations is exponentially increased by this means. Experiments based on real datasets show that SubSyn algorithm can predict destinations for up to ten times more query trajectories than a baseline algorithm while the SubSyn prediction algorithm runs over two orders of magnitude faster than the baseline algorithm. In this paper, we also consider the privacy protection issue in case an adversary uses SubSyn algorithm to derive sensitive location information of users. We propose an efficient algorithm to select a minimum number of locations a user has to hide on her trajectory in order to avoid privacy leak. Experiments also validate the high efficiency of the privacy protection TUE/9 algorithm. Research 8: Social Networks 2 - 3:30PM Chair: Yuanyuan Tian (IBM Almaden) Bastille 2 Scalable and Parallelizable Processing of Influence Maximization for Large-Scale Social Networks Jinha Kim, Seung-Keol Kim, Hwanjo Yu (Pohang University of Science and Technology) As social network services connect people across the world, influence maximization, i.e., finding the most influential nodes (or individuals) in the network, is being actively researched with applications to viral marketing. One crucial challenge in scalable influence maximization processing is evaluating influence, which is #P-hard and thus hard to solve in polynomial time. We propose a scalable influence approximation algorithm, Independent Path Algorithm (IPA) for Independent Cascade (IC) diffusion model. IPA efficiently approximates influence by considering an independent influence path as an influence evaluation unit. IPA are also easily parallelized by simply adding a few lines of OpenMP meta-programming expressions. Also, overhead of maintaining influence paths in memory is relieved by safely throwing away insignificant influence paths. Extensive experiments conducted on largescale real social networks show that IPA is an order of magnitude faster and uses less memory than the state of the art algorithms. Our experimental results also show that parallel versions of IPA speeds up further as the number of CPU cores increases, and more speed-up is achieved for larger datasets. The algorithms have been implemented in our demo application for influence maximization (available at http://dm.postech.ac.kr/ipa demo), which efficiently finds the most influential nodes in a social network. ICDE 2013 Conference 45 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 SociaLite: Datalog Extensions for Efficient Social Network Analysis Jiwon Seo, Stephen Guo, Monica S. Lam (Stanford University) With the rise of social networks, large-scale graph analysis becomes increasingly important. Because SQL lacks the expressiveness and performance needed for graph algorithms, lower-level, general-purpose languages are often used instead. For greater ease of use and efficiency, we propose SociaLite, a high-level graph query language based on Datalog. As a logic programming language, Datalog allows many graph algorithms to be expressed succinctly. However, its performance has not been competitive when compared to low-level languages. With SociaLite, users can provide high-level hints on the data layout and evaluation order; they can also define recursive aggregate functions which, as long as they are meet operations, can be evaluated incrementally and efficiently. We evaluated SociaLite by running eight graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neigh- bors, connected components, triangles, clustering coefficients, and betweenness centrality) on two real-life social graphs, LiveJournal and Last.fm. The optimizations proposed in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even outperforms typical Java implementations by an average of 50% for the graph algorithms tested. When compared to highly optimized Java implementations, SociaLite programs are an order of magnitude more succinct and easier to write. Its performance is competitive, giving up only 16% for the largest benchmark. Most importantly, being a query language, SociaLite enables many more users who are not proficient in software engineering to make social network queries easily and efficiently. LinkProbe: Probabilistic Inference on Large-Scale Social Networks Haiquan Chen (Valdosta State University) Wei-Shinn Ku (Auburn University) Haixun Wang (Microsoft Research Asia) Liang Tang (Auburn University) Min-Te Sun (National Central University, Taiwan) As one of the most important Semantic Web applications, social network analysis has attracted more and more interests from researchers due to the rapidly increasing availability of massive social network data. A desired solution for social network analysis should address the following issues. First, in many real world applications, inference rules are partially correct. An ideal solution should be able to handle partially correct rules. Second, applications in practice often involve large amounts of data. The inference mechanism should scale up towards large-scale data. Third, inference method should take into account probabilistic evidence data because these are apparently domains abounding with uncertainty. Various solutions for social network analysis have been around for quite a few years; however, none of them support all the aforementioned features. In this paper, we design and implement LinkProbe, a prototype to quantitatively predict existence of links among nodes in large-scale social networks, which is empowered by Markov Logic 46 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Networks (MLNs). MLN has been proved to be an effective inference model which can handle complex dependencies and partially correct rules. More important, although MLN has shown acceptable performance in prior works, MLN is also reported as impractical in handling large-scale data due to its highly demanding nature in terms of inference time and memory consumption. In order to overcome these limitations, LinkProbe retrieves the $k$-backbone graphs and conducts the MLN inference on both the most globally influencing nodes and most locally related nodes. Our extensive experiments show that LinkProbe manages to provide a tunable balance between MLN inference accuracy and inference efficiency. Seminar 3: Workload Management for Big Data Analytics TUE/9 2 - 3:30PM Odeon Ashraf Aboulnaga (University of Waterloo) Shivnath Babu (Duke University) Parallel database systems and MapReduce systems (most notably Hadoop) are essential components of today’s infrastructure for Big Data analytics. These systems process multiple concurrent workloads consisting of complex user requests, where each request is associated with an (explicit or implicit) service level objective. For example, the workload of a particular user or application may have a higher priority than other workloads. Or a particular workload may have strict deadlines for the completion of its requests. The research area of Workload Management focuses on ensuring that the system meets the service level objectives of various requests while at the same time minimizing the resources required to achieve this goal. At a high level, workload management can be viewed as looking beyond the performance of an individual request to the performance of an entire workload consisting of multiple requests. This tutorial will discuss the fundamentals of workload management, and present tools and techniques for workload management in parallel databases and MapReduce. Industry 2 2 - 3:30PM Chair: Hakan Hacıgümüş (NEC Laboratories America) Concorde Invited Paper: HFMS: Managing the Lifecycle and Complexity of Hybrid Analytic Data Flows Alkis Simitsis, Kevin Wilkinson, Umeshwar Dayal, Meichun Hsu (Hewlett-Packard Laboratories) To remain competitive, enterprises are evolving their business intelligence systems to provide dynamic, near real-time views of business activities. To enable this, they deploy complex workflows of analytic data flows that access multiple storage repositories and execution engines and that span the enterprise and even outside the enterprise. We call these multi-engine flows hybrid flows. Designing and optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows ICDE 2013 Conference 47 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 is even more challenging since their execution engines are likely under different administrative domains and there is no single point of control. To address these needs, we present a Hybrid Flow Management System (HFMS). It is an independent software layer over a number of independent execution engines and storage repositories. It simplifies the design of analytic data flows and includes optimization and executor modules to produce optimized executable flows that can run across multiple execution engines. HFMS dispatches flows for execution and monitors their progress. To meet service level objectives for a workload, it may dynamically change a flow’s execution plan to avoid processing bottlenecks in the computing infrastructure. We present the architecture of HFMS and describe its components. To demonstrate its potential benefit, we describe performance results for running sample batch workloads with and without HFMS. The ability to monitor multiple execution engines and to dynamically adjust plans enables HFMS to provide better service guarantees and better system utilization. 48 Invited Paper: KuaFu: Closing the Parallelism Gap in Database Replication Chuntao Hong (Microsoft Research Asia) Dong Zhou (Tsinghua University) Mao Yang (Microsoft Research Asia) Carbo Kuo (Tsinghua University) Lintao Zhang, Lidong Zhou (Microsoft Research Asia) Database systems are nowadays increasingly deployed on multi-core commodity servers, with replication to guard against failures. On one hand, a database engine is best de- signed to scale with the number of cores and to offer a high degree of parallelism on a modern multi-core architecture. On the other hand, replication traditionally resorts to a certain form of serialization for data consistency among replicas. In the widely used primary/backup replication with log shipping, concurrent executions on the primary and the serialized log replay on a backup creates a serious parallelism gap. Our experiments with MySQL, a popular open-source database system, shows that on a 16 core configuration the serial replay on a backup can sustain less than one third of the throughput achievable on the primary under a TPC-C- like OLTP workload. This paper proposes KuaFu to close the parallelism gap on replicated database systems by enabling concurrent replay of transactions on a backup. KuaFu maintains write consistency on backups by tracking transaction dependencies. Concurrent replay on a backup does introduce read inconsistency between the primary and a backup. KuaFu further leverages multi-version concurrency control to pro- duce snapshots in order to restore the consistency semantics. We have implemented KuaFu with MySQL; our evaluation- s show that KuaFu allows a backup to keep up with the primary while preserving replication consistency. ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Materialization Strategies in the Vertica Analytic Database: Lessons Learned TUE/9 Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek Bharathan, Chuck Bear (Vertica Systems, an HP Company) Column store databases allow for various tuple reconstruction strategies (also called materialization strategies). Early materialization is easy to implement but generally performs worse than late materialization. Late materialization is more complex to implement, and usually performs much better than early materialization, although there are situations where it is worse. We identify these situations, which essentially revolve around joins where neither input fits in memory (also called spilling joins). Sideways information passing techniques provide a viable solution to get the best of both worlds. We demonstrate how early materialization combined with sideways information passing allows us to get the benefits of late materialization, without the bookkeeping complexity or worse performance for spilling joins. It also provides some other benefits to query processing in Vertica due to positive interaction with compression and sort orders of the data. In this paper, we report our experiences with late and early materialization, highlight the strengths and weaknesses of them, and present the details of our sideways information passing implementation. We show experimental results of comparing these materialization strategies, which highlight the significant performance improvements provided by our implementation of sideways information passing (up to 72% on some TPC-H queries). Demo Groups 1 & 2 2 - 3:30PM Ballroom 2 Twitter+: Build Personalized Newspaper For Twitter Chen Liu, Anthony K. H. Tung (National University of Singapore) Nowadays, microblogging services, e.g., Twitter, have played important roles in people’s everyday lives. It enables users to publish and read text-based posts, known as ``tweets” and interact with each other through re-tweeting or commenting. In the literature, many efforts have been devoted on exploiting the social property of Twitter. However, except the social component, Twitter itself has become an indispensable source for users to acquire useful information. To maximize its value, we expect to pay more attention on the media property of Twitter. To be good media, the first requirement is that it should provide an effective presentation of its news so that users are facilitated of reading. Currently, all tweets from followings are presented to the users and usually organized by their published timelines or coming sources. However, too few dimensions of presenting tweets hinder users from finding their interested information conveniently. In this demo, we presents ``Twitter+’’, which aims to enrich user’s reading experiences in Twitter by providing multiple ways for them to explore tweets, such as keyword presentation, topic finding. It presents users an alternative interface to browse tweets more effectively. ICDE 2013 Conference 49 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 A Generic Database Benchmarking Service 50 Martin Kaufmann (ETH Zürich / SAP AG) Peter M. Fischer (Albert-Ludwigs-Universität Freiburg) Donald Kossmann (ETH Zürich) Norman May (SAP AG) Benchmarks are widely applied for the development and optimization of database systems. Standard benchmarks such as TPC-C and TPC-H provide a way of comparing the performance of different systems. In addition, micro benchmarks can be exploited to test a specific behavior of a system. Yet, despite all the benefits that can be derived from benchmark results, the effort of implementing and executing benchmarks remains prohibitive: Database systems need to be set up, a large number of artifacts such as data generators and queries need to be managed and complex, time-consuming operations have to be orchestrated. In this demo, we introduce a generic benchmarking service that combines a rich meta model, low marginal cost and ease of use, which drastically reduces the time and cost to define, adapt and run a benchmark. Aeolus: An Optimizer for Distributed Intra-Node-Parallel Streaming Systems Matthias J. Sax (Humboldt-Universität zu Berlin) Malu Castellanos, Qiming Chen, Meichun Hsu (Hewlett-Packard Laboratories) Aeolus is a prototype implementation of a topology optimizer on top of the distributed streaming system Storm. Aeolus extends Storm with a batching layer which can increase the topology’s throughput by more than one order of magnitude. Furthermore, Aeolus implements an optimization algorithm that computes the optimal batch size and degree of parallelism for each node in the topology automatically. Even if Aeolus is built on top of Storm, the developed concepts are not limited to Storm and can be applied to any distributed intranode-parallel streaming system. We propose to demo Aeolus using an interactive Web UI. One part of the Web UI is a topology builder allowing the user to interact with the system. Topologies can be created from scratch and their structure and/or parameters can be modified. Furthermore, the user is able to observe the impact of the changes on the optimization decisions and runtime behavior. Additionally, the Web UI gives a deep insight in the optimization process by visualizing it. The user can interactively step through the optimization process while the UI shows the optimizer’s state, computations, and decisions. The Web UI is also able to monitor the execution of a non-optimized and optimized topology simultaneously showing the advantage of using Aeolus. ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Crowd-Answering System via Microblogging TUE/9 Xianke Zhou, Ke Chen, Sai Wu, Bingbing Zhang (Zhejiang University) Most crowdsourcing systems leverage the public platforms, such as Amazon Mechanical Turk (AMT), to publish their jobs and collect the results. They are charged for using the platform’s service and they are also required to pay the workers for each successful job. Although the average wage of the online human worker is not high, for a 24x7 running service, the crowdsourcing system becomes very expensive to maintain. We observe that there are, in fact, many sources that can provide free online human volunteers. Microblogging system is one of the most promising human resources. In this paper, we present our CrowdAnswer system, which is built on top of Weibo, the largest microblogging system in China. CrowdAnswer is a question-answering system, which distributes various questions to different groups of microblogging users adaptively. The answers are then collected from those users’ tweets and visualized for the question originator. CrowdAnswer maintains a virtual credit system. The users need credits to publish questions and they can gain credits by answering the questions. A novel algorithm is proposed to route the questions to the interested users, which tries to maximize the probability of successfully answering a question. With a Little Help from My Friends Arnab Nandi (The Ohio State University) Stelios Paparizos, John C. Shafer, Rakesh Agrawal (Microsoft Research) A typical person has numerous online friends that, according to studies, the person often consults for opinions and advice. However, public broadcasting a question to all friends risks social capital when repeated too often, is not tolerant to topic sensitivity, and can result in no response, as the message is lost in a myriad of status updates. Direct messaging is more personal and avoids these pitfalls, but requires manual selection of friends to contact, which can be time consuming and challenging. A user may have difficulty guessing which of their numerous online friends can provide a high quality and timely response. We demonstrate a working system that addresses these issues by returning an ordered subset of friends predicting (a) near-term availability, (b) willingness to respond and (c) topical knowledge, given a query. The combination of these three aspects are unique to our solution, and all are critical to the problem of obtaining timely and relevant responses. Our system acts as a decision aid -- we give insight into why each friend was recommended and let the user decide whom to contact. ICDE 2013 Conference 51 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Peeking into the Optimization of Data Flow Programs with MapReduce-style UDFs Fabian Hueske (Technische Universität Berlin) Mathias Peters (Humboldt Universität zu Berlin) Aljoscha Krettek, Matthias Ringwald, Kostas Tzoumas, Volker Markl (Technische Universität Berlin) Johann-Christoph Freytag (Humboldt Universität zu Berlin) Data flows are a popular abstraction to define data-intensive processing tasks. In order to support a wide range of use cases, many data processing systems feature MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known from relational DBMS, MapReduce-style UDFs have less strict templates. These templates do not alone provide all the information needed to decide whether they can be reordered with relational operators and other UDFs. However, it is wellknown that reordering operators such as filters, joins, and aggregations can yield runtime improvements by orders of magnitude. We demonstrate an optimizer for data flows that is able to reorder operators with MapReduce-style UDFs written in an imperative language. Our approach leverages static code analysis to extract information from UDFs which is used to reason about the reorderbility of UDF operators. This information is sufficient to enumerate a large fraction of the search space covered by conventional RDBMS optimizers including filter and aggregation push-down, bushy join orders, and choice of physical execution strategies based on interesting properties. We demonstrate our optimizer and a job submission client that allows users to peek step-by-step into each phase of the optimization process: the static code analysis of UDFs, the enumeration of reordered candidate data flows, the generation of physical execution plans, and their parallel execution. For the demonstration, we provide a selection of relational and non-relational data flow programs which highlight the salient features of our approach. Very Fast Estimation for Result and Accuracy of Big Data Analytics: the EARL System Nikolay Laptev, Kai Zeng, Carlo Zaniolo (University of California, Los Angeles) Approximate results based on samples often provide the only way in which advanced analytical applications on very massive data sets (a.k.a. ‘big data’) can satisfy their time and resource constraints. Unfortunately, methods and tools for the computation of accurate early results are currently not supported in big data systems (e.g., Hadoop). Therefore, we propose a nonparametric accuracy estimation method and system to speed-up big data analytics. Our framework is called EARL (Early Accurate Result Library) and it works by predicting the learning curve and choosing the appropriate sample size for achieving the desired error bound specified by the user. The error estimates are based on a technique called bootstrapping that has been widely used and validated by statisticians, and can be applied to arbitrary functions and data distributions. Therefore, this demo will elucidate (a) the functionality of EARL and its intuitive GUI interface whereby 52 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL first-time users can appreciate the accuracy obtainable from increasing sample sizes by simply viewing the learning curve displayed by EARL, (b) the usability of EARL, whereby conference participants can interact with the system to quickly estimate the sample sizes needed to obtain the desired accuracies or response times, and then compare them against the accuracies and response times obtained in the actual computations. Road Network Mix-zones for Anonymous Location Based Services TUE/9 Balaji Palanisamy, Sindhuja Ravichandran, Ling Liu, Binh Han, Kisung Lee, Calton Pu (Georgia Institute of Technology) We present MobiMix, a road network based mix-zone framework to protect location privacy of mobile users traveling on road networks. An alternative and complementary approach to spatial cloaking based location privacy protection is to break the continuity of location exposure by introducing techniques, such as mix-zones, where no applications can trace user movements. However, existing mix-zone proposals fail to provide effective mix-zone construction and placement algorithms that are resilient to timing and transition attacks. In MobiMix, mix-zones are constructed and placed by carefully taking into consideration of multiple factors, such as the geometry of the zones, the statistical behavior of the user population, the spatial constraints on movement patterns of the users, and the temporal and spatial resolution of the location exposure. In this demonstration, we first introduce a visualization of the location privacy risks of mobile users traveling on road networks and show how mix-zone based anonymization breaks the continuity of location exposure to protect user location privacy. We demonstrate a suite of road network mix-zone construction and placement methods that provide higher level of resilience to timing and transition attacks on road networks. We show the effectiveness of the MobiMix approach through detailed visualization using traces produced by GTMobiSim on different scales of geographic maps. Query Time Scaling of Attribute Values in Interval Timestamped Databases Anton Dignös, Michael Böhlen (University of Zurich) Johann Gamper (Free University of Bozen-Bolzano) In valid-time databases with interval timestamping each tuple is associated with a time interval over which the recorded fact is true in the modeled reality. The adjustment of these intervals is an essential part of processing interval timestamped data. Some attribute values remain valid if the associated interval changes, whereas others have to be scaled along with the time interval. For example, attributes that record total (cumulative) quantities over time, such as project budgets, total sales or total costs, often must be scaled if the timestamp is adjusted. The goal of this demo is to show how to support the scaling of attribute values in SQL at query time. ICDE 2013 Conference 53 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Extracting Interesting Related Context-dependent Concepts from Social Media Streams using Temporal Distributions Craig P. Sayers, Meichun Hsu (Hewlett-Packard Labs) To enable the interactive exploration of large social media datasets we exploit the temporal distributions of word n-grams within the message stream to discover “interesting” concepts, determine “relatedness” between concepts, and find representative examples for display. We present a new algorithm for contextdependent “interestingness” using the coefficient of variation of the temporal distribution, apply the well-known technique of Pearson’s Correlation to tweets using equi-height histogramming to determine correlation, and employ an asymmetric variant for computing “relatedness” to encourage exploration. We further introduce techniques using interestingness, correlation, and relatedness to automatically discover concepts and select preferred word N-grams for display. These techniques are demonstrated on an 800,000 tweet dataset from the Academy Awards. VERDICT: Privacy-Preserving Authentication of Range Queries in Location-based Services Haibo Hu, Qian Chen, Jianliang Xu (Hong Kong Baptist University) We demonstrate VERDICT, a location-based range query service featuring the privacy-preserving authentication capability. VERDICT adopts the common dataas-a-service (DaaS) model, which consists of the data owner (a location registry or a mobile operator) who provides the querying data, the service provider who executes the query, and the querying users. The system features a privacy-preserving query authentication module that enables the user to verify the correctness of results while still protecting the data privacy. This feature is crucial in many locationbased services where the querying data are user locations. To achieve this, VERDICT employs an MR-tree based privacy-preserving authentication scheme proposed in our earlier work. The use case study shows that VERDICT provides efficient and smooth user experience for authenticating location-based range queries. ExpFinder: Finding Experts by Graph Pattern Matching Wenfei Fan (The University of Edinburgh / Beihang University) Xin Wang (The University of Edinburgh) Yinghui Wu (University of California Santa Barbara) We present ExpFinder, a system for finding experts in social networks based on graph pattern matching. We demonstrate (1) how ExpFinder identifies top-K experts in a social network by supporting bounded simulation of graph patterns, and by ranking the matches based on a metric for social impact; (2) how it copes with the sheer size of real-life social graphs by supporting incremental query evaluation and query preserving graph compression, and (3) how the GUI of ExpFinder interacts with users to help them construct queries and inspect matches. 54 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL Tajo: A Distributed Data Warehouse System on Large Clusters TUE/9 Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Yon Dohn Chung (Korea University) The increasing volumes of relational data let us find an alternative to cope with them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between parallel databases and Hadoop have been introduced to the database community. Although these hybrid approaches have gained wide popularity, they cannot avoid the choice of suboptimal execution strategies. We believe that this problem is caused by the inherent limits of their architectures. In this demo, we present Tajo, a relational, distributed data warehouse system on shared-nothing clusters. It uses Hadoop Distributed File System (HDFS) as the storage layer and has its own query execution engine that we have developed instead of the MapReduce framework. A Tajo cluster consists of one master node and a number of workers across cluster nodes. The master is mainly responsible for query planning and the coordinator for workers. The master divides a query into small tasks and disseminates them to workers. Each worker has a local query engine that executes a directed acyclic graph of physical operators. A DAG of operators can take two or more input sources and be pipelined within the local query engine. In addition, Tajo can control distributed data flow more flexible than that of MapReduce and supports indexing techniques. By combining these features, Tajo can employ more optimized and efficient query processing, including the existing methods that have been studied in the traditional database research areas. To give a deep understanding of the Tajo architecture and behavior during query processing, the demonstration will allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user interface will show (1) how the submitted queries are planned, (2) how the query are distributed across nodes, (3) the cluster and node status, and (4) the detail of relations and their physical information. Also, we provide the performance evaluation of Tajo compared with Hive. Research 9: Indexing Structures 4 - 5:30PM Chair: Rui Zhang (University of Melbourne) Ballroom 1 The Bw-Tree: A B-tree for New Hardware Platforms Justin Levandoski, David B. Lomet, Sudipta Sengupta (Microsoft Research) The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log ICDE 2013 Conference 55 DETAILED PROGRAM FOR TUESDAY 9 APRIL structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance. TUE/9 Secure and Efficient Range Queries on Outsourced Databases Using ˆR-Trees Peng Wang, Chinya V. Ravishankar (University of California-Riverside) We show how to execute range queries securely and efficiently on encrypted databases in the cloud. Current methods provide either security or efficiency, but not both. Many schemes even reveal the ordering of encrypted tuples, which, as we show, allows adversaries to estimate plaintext values accurately. We present the ˆRTrees a hierarchical encrypted index that may be securely placed in the cloud, and searched efficiently. It is based on a mechanism we design for encrypted halfspace range queries in Rd , using Asymmetric Scalar-product Preserving Encryption. Data owners can tune the ˆR-Trees parameters to achieve desired security-efficiency tradeoffs. We also present extensive experiments to evaluate ˆR-Trees performance. Our results show that ˆR-Trees queries are efficient on encrypted databases, and reveal far less information than competing methods. An Efficient and Compact Indexing Scheme for Large-scale Data Store 56 Peng Lu (National University of Singapore) Sai Wu, Lidan Shou (Zhejiang University) Kian-Lee Tan (National University of Singapore) The amount of data managed in today’s Cloud systems has reached an unprecedented scale. In order to speed up query processing, an effective mechanism is to build indexes on attributes that are used in query predicates. However, conventional indexing schemes fail to provide a scalable service: as the size of these indexes are proportional to the data size, it is not space efficient to build many indexes. As such, it becomes more crucial to develop effective index to provide scalable database services in the Cloud. In this paper, we propose a compact bitmap indexing scheme for a large-scale data store. The bitmap indexing scheme combines state-of-the-art bitmap compression techniques, such as WAH encoding and bitsliced encoding. To further reduce the index cost, a novel and query efficient partial indexing technique is adopted, which dynamically refreshes the index to handle updates and process queries. The intuition of our indexing approach is to maximize the number of indexed attributes, so that a wider range of queries, including range and join queries, can be efficiently supported. Our indexing scheme is light-weight and its creation can be seamlessly grafted onto the MapReduce processing engine without incurring significant running cost. Moreover, the compactness allows us to maintain the bitmap indexes in memory so that performance overhead of index access is minimal. We implement our indexing scheme on top of the underlying Distributed File System (DFS) and evaluate its performance on an in-house cluster. ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL We compare our index-based query processing with HadoopDB to show its superior performance. Our experimental results confirm the effectiveness, efficiency and scalability of the indexing scheme. Research 10: Main Memory Query Processing 4 - 5:30PM Chair: Paul Larson (Microsoft) St Germaine Recycling in Pipelined Query Evaluation TUE/9 Fabian Nagel (The University of Edinburgh) Peter Boncz (Centrum Wiskunde & Informatica) Stratis Viglas (The University of Edinburgh) Database systems typically execute queries in isolation. Sharing recurring intermediate and final results between successive query invocations is ignored or only exploited by caching final query results. The DBA is kept in the loop to make explicit sharing decisions by identifying and/or defining materialized views. Thus decisions are made only after a long time and sharing opportunities may be missed. Recycling intermediate results has been proposed as a method to make database query engines profit from opportunities to reuse fine-grained partial query results, that is fully autonomous and is able to continuously adapt to changes in the workload. The technique was recently revisited in the context of MonetDB, a system that by default materializes all intermediate results. Materializing intermediate results can consume significant system resources, therefore most other database systems avoid this where possible, following a pipelined query architecture instead. The novelty of this paper is to show how recycling can successfully be applied in pipelined query executors, by tracking the benefit of materializing possible intermediate results and then choosing the ones making best use of a limited intermediate result cache. We present ways to maximize the potential of recycling by leveraging subsumption and proactive query rewriting. We have implemented our approach in the Vectorwise database engine and have experimentally evaluated its potential using both synthetic and real-world datasets. Our results show that intermediate result recycling significantly improves performance. Efficient Many-Core Query Execution in Main Memory Column-Stores Jonathan Dees (SAP AG / Karlsruhe Institute of Technology) Peter Sanders (Karlsruhe Institute of Technology) We use the full query set of the TPC-H Benchmark as a case study for the efficient implementation of decision support queries on main memory columnstore databases. Instead of splitting a query into separate independent operators, we consider the query as a whole and translate the execution plan into a single function performing the query. This allows highly efficient CPU utilization, minimal materialization, and execution in a single pass over the data for most queries. The single pass is performed in parallel and scales near-linearly with the number of ICDE 2013 Conference 57 DETAILED PROGRAM FOR TUESDAY 9 APRIL cores. The resulting query plans for most of the 22 queries are remarkably simple and are suited for automatic generation and fast compilation. Using a data-parallel, NUMA-aware many-core implementation with block summaries, inverted index data structures, and efficient aggregation algorithms, we achieve one to two orders of magnitude better performance than the current record holders of the TPC-H Benchmark. TUE/9 Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying Hardware Cagri Balkesen, Jens Teubner, Gustavo Alonso (ETH Zürich) M. Tamer Özsu (University of Waterloo) The architectural changes introduced with multi-core CPUs have triggered a redesign of main-memory join algorithms. In the last few years, two diverging views have appeared. One approach advocates careful tailoring of the algorithm to the architectural parameters (cache sizes, TLB, and memory bandwidth). The other approach argues that modern hardware is good enough at hiding cache and TLB miss latencies and, consequently, the careful tailoring can be omitted without sacrificing performance. In this paper we demonstrate through experimental analysis of different algorithms and architectures that hardware still matters. Join algorithms that are hardware conscious perform better than hardware-oblivious approaches. The analysis and comparisons in the paper show that many of the claims regarding the behavior of join algorithms that have appeared in literature are due to selection effects (relative table sizes, tuple sizes, the underlying architecture, using sorted data, etc.) and are not supported by experiments run under different parameters settings. Through the analysis, we shed light on how modern hardware affects the implementation of data operators and provide the fastest implementation of radix join to date, reaching close to 200 million tuples per second. Research 11: Data Mining I 4 - 5:30 PM Chair: Gautam Das (Qatar Computing Research Institute) Bastille 1 Coupled Clustering Ensemble: Incorporating Coupling Relationships Both between Base Clusterings and Objects Can Wang, Zhong She, Longbing Cao (University of Technology Sydney) Clustering ensemble is a powerful approach for improving the accuracy and stability of individual (base) clustering algorithms. Most of the existing clustering ensemble methods obtain the final solutions by assuming that base clusterings perform independently with one another and all objects are independent too. However, in real-world data sources, objects are more or less associated in terms of certain coupling relationships. Base clusterings trained on the source data are complementary to one another since each of them may only capture some 58 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 specific rather than full picture of the data. In this paper, we discuss the problem of explicating the dependency between base clusterings and between objects in clustering ensembles, and propose a framework for coupled clustering ensembles (CCE). CCE not only considers but also integrates the coupling relationships between base clusterings and between objects. Specifically, we involve both the intra-coupling within one base clustering (i.e., cluster label frequency distribution) and the inter-coupling between different base clusterings (i.e., cluster label cooccurrence dependency). Furthermore, we engage both the intra-coupling between two objects in terms of the base clustering aggregation and the inter-coupling among other objects in terms of neighborhood relationship. This is the first work which explicitly addresses the dependency between base clusterings and between objects, verified by the application of such couplings in three types of consensus functions: clustering-based, object-based and cluster-based. Substantial experiments on synthetic and UCI data sets demonstrate that the CCE framework can effectively capture the interactions embedded in base clusterings and objects with higher clustering accuracy and stability compared to several state-of-the-art techniques, which is also supported by statistical analysis. Focused Matrix Factorization For Audience Selection in Display Advertising Bhargav Kanagal, Amr Ahmed (Google Inc.) Sandeep Pandey (Twitter Inc.) Vanja Josifovski, Lluis Garcia-Pueyo (Google Inc.) Jeff Yuan (Yahoo! Research) Audience selection is a key problem in display advertising systems in which we need to select a list of users who are interested (i.e., most likely to buy) in an advertising campaign. The users' past feedback on this campaign can be leveraged to construct such a list using collaborative filtering techniques such as matrix factorization. However, the user-campaign interaction is typically extremely sparse, hence the conventional matrix factorization does not perform well. Moreover, simply combining the users feedback from all campaigns does not address this since it dilutes the focus on target campaign in consideration. To resolve these issues, we propose a novel focused matrix factorization model (FMF) which learns users' preferences towards the specific campaign products, while also exploiting the information about related products. We exploit the product taxonomy to discover related campaigns, and design models to discriminate between the users' interest towards campaign products and non-campaign products. We develop a parallel multi-core implementation of the FMF model and evaluate its performance over a real-world advertising dataset spanning more than a million products. Our experiments demonstrate the benefits of using our models over existing approaches. ICDE 2013 Conference 59 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Graph Stream Classification using Labeled and Unlabeled Graphs Shirui Pan, Xingquan Zhu, Chengqi Zhang (University of Technology Sydney) Philip S. Yu (University of Illinois at Chicago) Graph classification is becoming increasingly popular due to the rapidly rising applications involving data with structural dependency. The wide spread of the graph applications and the inherent complex relationships between graph objects have made the labels of the graph data expensive and/or difficult to obtain, especially for applications involving dynamic changing graph records. While labeled graphs are limited, the copious amounts of unlabeled graphs are often easy to obtain with trivial efforts. In this paper, we propose a framework to build a stream based graph classification model by combining both labeled and unlabeled graphs. Our method, called gSLU, employs an ensemble-based framework to partition graph streams into a number of graph chunks each containing some labeled and unlabeled graphs. For each individual chunk, we propose a minimum-redundancy subgraph feature selection module to select a set of informative subgraph features to build a classifier. To tackle the concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the subgraph feature selection can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-world graph streams demonstrate clear benefits of using minimum-redundancy subgraph features to build accurate classifiers. By employing instance level weighting, our graph ensemble model can effectively adapt to the concept drifting in the graph stream for classification. Research 12: Moving Objects 4 - 5:30PM Chair: Shuo Shang (Aalborg University) Bastille 2 T-Share: A Large-Scale Dynamic Taxi Ridesharing Service Shuo Ma (University of Illinois at Chicago / Microsoft Research Asia) Yu Zheng (Microsoft Research Asia) Ouri Wolfson (University of Illinois at Chicago / Microsoft Research Asia) Taxi ridesharing can be of significant social and environmental benefit, e.g. by saving energy consumption and satisfying more people’s commute needs in peak hours. Despite the great potential, taxi ridesharing, especially with dynamic queries, is not well studied. In this paper, we formally define the dynamic ridesharing problem and propose a large-scale taxi ridesharing service. It efficiently serves real-time requests sent by taxi users and generates ridesharing schedules that reduce the total travel distance significantly. In our method, we first propose a taxi searching algorithm using a spatio-temporal index to quickly retrieve candidate taxis that are likely to satisfy a user query. A scheduling algorithm is then proposed. It checks each candidate taxi and inserts the query’s trip into the schedule of the taxi which satisfies the query with minimum additional incurred travel distance. To tackle the heavy computational 60 ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL load, a lazy shortest path calculation strategy is devised to speed up the scheduling algorithm. We evaluated our service using a GPS trajectory dataset generated by over 33,000 taxis during a period of 3 months. By learning the spatio-temporal distributions and the stochastic process of real user queries from this dataset, we built an experimental platform that can simulate user behaviours in taking a taxi in the real-world. Tested on this platform with extensive experiments, our approach demonstrated its efficiency, effectiveness, and scalability. For example, our proposed service can serve 25% additional taxi users while saving 13% travel distance compared with no-ridesharing (when the ratio of the number of queries to the number of taxis is 6). Efficient Notification of Meeting Points for Moving Groups via Independent Safe Regions TUE/9 Jing Li (The University of Hong Kong) Man Lung Yiu (Hong Kong Polytechnic University) Nikos Mamoulis (The University of Hong Kong) In applications like social networking services and online games, multiple moving users form a group and wish to be continuously notified with the best meeting point from their locations. A promising technique for reducing the communication frequency of the application server is to apply safe regions, which capture the validity of query results with respect to the users’ locations. Unfortunately, the safe regions in our problem exhibit characteristics such as irregular shapes and dependency among multiple safe regions. These unique characteristics render existing safe region methods that focus on a single safe region inapplicable to our problem. To tackle these challenges, we first examine the shapes of safe regions in our problem context and propose feasible approximations for them. We design efficient algorithms for computing these safe regions, as well as develop compression techniques for representing safe regions in a compact manner. Experiments with both real and synthetic data demonstrate the efficiency of our proposal in terms of computation and communication costs. Efficient Distance-Aware Query Evaluation on Indoor Moving Objects Xike Xie, Hua Lu, Torben Bach Pedersen (Aalborg University) Indoor spaces accommodate large parts of people’s life. The increasing availability of indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables a variety of indoor location-based services (LBSs). Efficient indoor distance-aware queries on indoor moving objects play an important role in supporting and boosting such LBSs. However, the distance-aware query evaluation on indoor moving objects is challenging because: (1) indoor spaces are characterized by many special entities and thus render distance calculation very complex; (2) the limitations of indoor positioning technologies create inherent uncertainties in indoor moving objects data. In this paper, we propose a complete set of techniques for efficient distance-aware queries on indoor moving objects. We define and categorize the indoor distances in ICDE 2013 Conference 61 DETAILED PROGRAM FOR TUESDAY 9 APRIL relation to indoor uncertain objects, and derive different distance bounds that can facilitate query evaluation. Existing works often assume indoor floor plans are static, and require extensive pre-computation on indoor topologies. In contrast, we design a composite index scheme that integrates indoor geometries, indoor topologies, as well as indoor uncertain objects, and thus supports indoor distance-aware queries efficiently without time-consuming and volatile distance computation. We design algorithms for range query and k nearest neighbor query on indoor moving objects. The results of extensive experimental studies demonstrate that our proposals are efficient and scalable in evaluating distance-aware queries over indoor moving objects. TUE/9 Seminar 4: Knowledge Harvesting from Text and Web Sources 4 - 5:30PM Odeon Fabian Suchanek, Gerhard Weikum (Max Planck Institute for Informatics) The proliferation of knowledge-sharing communities such as Wikipedia and the progress in scalable information extraction from Web and text sources has enabled the automatic construction of very large knowledge bases. Recent endeavors of this kind include academic research projects such as DBpedia, KnowItAll, Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and Trueknowledge. These projects provide automatically constructed knowledge bases of facts about named entities, their semantic classes, and their mutual relationships. Such world knowledge in turn enables cognitive applications and knowledge-centric services like disambiguating natural-language text, deep question answering, and semantic search for entities and relations in Web and enterprise data. Prominent examples of how knowledge bases can be harnessed include the Google Knowledge Graph and the IBM Watson question answering system. This tutorial presents stateof-the-art methods, recent advances, research opportunities, and open challenges along this avenue of knowledge harvesting and its applications. Industry 3 4 - 5:30PM Chair: Vibhor Rastogi (Google Inc.) Concorde Pipe Failure Prediction: A Data Mining Method 62 Rui Wang (University of Science & Technology of China) Weishan Dong, Yu Wang (IBM Research – China) Ke Tang (University of Science & Technology of China) Xin Yao (University of Science & Technology of China / The University of Birmingham) Pipe breaks in urban water distribution network lead to significant economical and social costs, putting the service quality as well as the profit of water utilities at risk. To cope with such a situation, scheduled preventive maintenance is desired, which aims to predict and fix potential break pipes proactively. Physical models developed for understanding and predicting the failure of pipes are usually expensive, thus can ICDE 2013 Conference DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 only be used on a limited number of trunk pipes. As an alternative, statistical models that try to predict pipe breaks based on historical data are far less expensive, and therefore have attracted a lot of interests from water utilities recently. In this paper, we report a novel data mining prediction system that has been built for a water utility in a big Chinese city. Various aspects of how to build such a system are described, including problem formulation, data cleaning, model construction, as well as evaluating the importance of attributes according to the requirements of end users in water utilities. Satisfactory results have been achieved by our prediction system. For example, with the system trained on the available dataset at the end of 2010, the water utility would avoid 50% of pipe breaks in 2011 by examining only 6.98% of its pipes in advance. During the construction of the system, we find that the extremely skew distribution of break and non-break pipes, interestingly, is not an obstacle. This lesson could serve as a practical reference for both academic studies on imbalanced learning as well as future explorations on pipe failure prediction problems. SASH: Enabling Continuous Incremental Analytic Workflows on Hadoop Manish Sethi, Narendran Sachindran, Sriram Raghavan (IBM India Research Lab) There is an emerging class of enterprise applications in areas such as log data analysis, information discovery, and social media marketing that involve analytics over large volumes of unstructured and semi-structured data. These applications are leveraging new analytics platforms based on the MapReduce framework and its open source Hadoop implementation. While this trend has engendered work on high-level data analysis languages, NoSQL data stores, workflow engines etc., there has been very little attention to the challenges of deploying analytic workflows into production for continuous operation. In this paper, we argue that an essential platform component for enabling continuous production analytic workflows is an analytics store. We highlight five key requirements that impact the design of such a store: (i) efficient incremental operations, (ii) flexible storage model for hierarchical data, (iii) snapshot isolation (iv) object-level incremental updates, and (v) support for handling change sets. We describe the design of SASH, a scalable analytics store that we have developed on top of HBase to address these requirements. Using the workload from a production workflow that powers search within IBM’s intranet and extranet, we demonstrate orders of magnitude improvement in IO performance using SASH. ICDE 2013 Conference 63 DETAILED PROGRAM FOR TUESDAY 9 APRIL TUE/9 Automating Pattern Discovery for Rule Based Data Standarization Systems Snigdha Chaturvedi, Hima Prasad K, Tanveer A. Faruquie, Bhupesh S. Chawda, L Venkata Subramaniam, Raghuram Krishnapuram (IBM Research-India) Data quality is a perennial problem for many enterprise data assets. To improve data quality, businesses often employ rule based data standardization systems in which domain experts code rules for handling important and prevalent patterns. Finding these patterns is laborious and time consuming, particularly for noisy or highly specialized data sets. It is also subjective to the persons determining these patterns. In this paper we present a tool to automatically mine patterns that can help in improving the efficiency and effectiveness of these data standardization systems. The automatically extracted patterns are used by the domain and knowledge experts for rule writing. We use a greedy algorithm to extract patterns that result in a maximal coverage of data. We further group the extracted pat- terns such that each group represents patterns that capture similar domain knowledge. We propose a similarity measure that uses in- put pattern semantics to group these patterns. We demonstrate the effectiveness of our method for standardization tasks on three real world datasets. Demo Groups 1 & 2 4 - 5:30PM See Demo Groups 1 & 2 on (p. 49) for demonstration details. 64 Ballroom 2 ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Wednesday 10 April Keynote 2 9 - 10AM Chair: Chris Jermaine (Rice University) Ballroom Le Grand Recent Advances on Structured Data and the Web Alon Halevy (Google Inc.) Abstract: The World-Wide Web contains vast quantities of structured data on a variety of domains, such as hobbies, products and reference data. Moreover, the Web provides a platform that can encourage publishing more data sets from governments and other public organizations and support new data management opportunities, such as effective crisis response, data journalism and crowd-sourcing data sets. For the first time since the emergence of the Web, structured data is being used widely by search engines and is being collected via a concerted effort. I will describe some of the efforts we are conducting at Google to collect structured data, filter the highquality content, and serve it to our users. These efforts include providing Google Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the Web for high-quality HTML tables, and contributing these data assets to Google’s other services. ICDE 2013 Conference WED/10 Bio: Alon Halevy heads the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). He received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem. Halevy is also a coffee culturalist and published the book “The Infinite Emotions of Coffee”, published in 2011 and a co-author of the book “Principles of Data Integration”, published in 2012. 65 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Research 13: Data Cleaning 10:30 - 12PM Chair: Raghav Kaushik (Microsoft) St Germaine HANDS: A Heuristically Arranged Non-Backup In-line Deduplication System Avani Wildani, Ethan L. Miller (University of California Santa Cruz) Ohad Rodeh (IBM Almaden Research Center) Deduplicating in-line data on primary storage is hampered by the disk bottleneck problem, an issue which results from the need to keep an index mapping portions of data to hash values in memory in order to detect duplicate data without paying the performance penalty of disk paging. The index size is proportional to the volume of unique data, so placing the entire index into RAM is not cost effective with a deduplication ratio below 45%. HANDS reduces the amount of in-memory index storage required by up to 99% while still achieving between 30% and 90% of the deduplication a full memory-resident index provides, making primary deduplication cost effective in workloads with deduplication rates as low as 8%. HANDS is a framework that dynamically pre-fetches fingerprints from disk into memory cache according to working sets statistically derived from access patterns. We use a simple neighborhood grouping as our statistical technique to demonstrate the effectiveness of our approach. HANDS is modular and requires only spatio-temporal data, making it suitable for a wide range of storage systems without the need to modify host file systems. WED/10 Holistic Data Cleaning: Putting Violations Into Context 66 Xu Chu (University of Waterloo) Ihab Ilyas, Paolo Papotti (Qatar Computing Research Institute) Data cleaning is an important problem and data quality rules are the most promising way to face it with a declarative approach. Previous work has focused on specific formalisms, such as functional dependencies (FDs), conditional functional dependencies (CFDs), and matching dependencies (MDs), and those have always been studied in isolation. Moreover, such techniques are usually applied in a pipeline or interleaved. In this work we tackle the problem in a novel, unified framework. First, we let users specify quality rules using denial constraints with adhoc predicates. This language subsumes existing formalisms and can express rules involving numerical values, with predicates such as ``greater than’’ and ``less than’’. More importantly, we exploit the interaction of the heterogeneous constraints by encoding them in a conflict hypergraph. Such holistic view of the conflicts is the starting point for a novel definition of "repair context" which allows us to compute automatically repairs of better quality w.r.t. previous approaches in the literature. Experimental results on real datasets show that the holistic approach outperforms previous algorithms in terms of quality and efficiency of the repair. ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Inferring Data Currency and Consistency for Conflict Resolution Wenfei Fan (The University of Edinburgh / Beihang University) Floris Geerts (University of Antwerp) Nan Tang (Qatar Computing Research Institute) Wenyuan Yu (The University of Edinburgh) This paper introduces a new approach for conflict resolution: given a set of tuples pertaining to the same entity, it is to identify a single tuple in which each attribute has the latest and consistent value in the set. This problem is important in data integration, data cleaning and query answering. It is, however, challenging since in practice, reliable timestamps are often absent, among other things. We propose a model for conflict resolution, by specifying data currency in terms of partial currency orders and currency constraints, and by enforcing data consistency with constant conditional functional dependencies. We show that identifying data currency orders helps us repair inconsistent data, and vice versa. We investigate a number of fundamental problems associated with conflict resolution, and establish their complexity. In addition, we introduce a framework and develop algorithms for conflict resolution, by integrating data currency and consistency inferences into a single process, and by interacting with users. We experimentally verify the accuracy and efficiency of our methods using real-life and synthetic data. Research 14: Social media I 10:30 - 12PMChair: Kevin Chang (University of Illinois at Urbana-Champaign)Bastille 1 LSII: An Indexing Structure for Exact Real-Time Search on Microblogs ICDE 2013 Conference WED/10 Lingkun Wu (Nanyang Technological University / A*STAR Singapore) Wenqing Lin, Xiaokui Xiao (Nanyang Technological University) Yabo Xu (Sun Yat-Sen University) Indexing microblogs for real-time search is challenging given the efficiency issue caused by the tremendous speed at which new microblogs are created by users. Existing approaches address this efficiency issue at the cost of query accuracy, as they either (i) exclude a significant portion of microblogs from the index to reduce update cost or (ii) rank microblogs mostly by their timestamps (without sufficient consideration of their relevance to the queries) to enable append-only index insertion. As a consequence, the search results returned by the existing approaches do not satisfy the users who demand timely and high-quality search results. To remedy this deficiency, we propose the Log-Structured Inverted Indices (LSII), a structure for exact real-time search on microblogs. The core of LSII is sequence of inverted indices with exponentially increasing sizes, such that new microblogs are (i) first inserted into the smallest index and (ii) later merged with the larger indices in a batch manner. The batch insertion mechanism ensures a small amortize update cost for each new microblog, without significantly degrading query performance. We present a comprehensive study on LSII, exploring various design options to strike a good balance between query and update performance. In addition, we propose 67 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL extensions of LSII to support personalized search and to exploit multi-threading for performance improvement. Extensive experiments demonstrate the efficiency of LSII with experiments on real data. Utilizing Users’ Tipping Points in E-Commerce Recommender Systems Kailun Hu, Wynne Hsu, Mong Li Lee (National University of Singapore) Existing recommendation algorithms assume that users make their purchase decisions solely based on individual preferences, without regard to the purchase behavior of other users. Yet, extensive studies have shown that there are two types of users: innovators and imitators. Innovators tend to make purchase decisions based solely on their own preferences; whereas imitators’ purchase decisions are often influenced by social pressure from other users. In this paper, we propose a framework that seamlessly incorporate the influence of social pressure into existing recommendation algorithms. We utilize the Bass model to classify each user as either an innovator or imitator according to his/her previous purchase behavior. In addition, we introduce the concept of pressure point of a user to capture the user’s reaction to varying degree of social pressure when making a purchase decision. We then refine two widely-adopted recommendation algorithms to incorporate the effect of social pressure in relation to the user’s pressure point. Experiment results on a real-world dataset obtained from an E-commerce website show that the proposed approach outperforms existing algorithms. WED/10 Presenting Diverse Location Views with Real-time Near-duplicate Photo Elimination 68 Jiajun Liu, Zi Huang (The University of Queensland) Hong Cheng (The Chinese Univerity of Hong Kong) Yueguo Chen (Renmin University of China) Heng Tao Shen (The University of Queensland) Yanchun Zhang (Victoria University) Supported by the technical advances and the commercial success of GPS-enabled mobile devices, geo-tagged photos have drawn plenteous attention in research community. The explosive growth of geo-tagged photos enables many large-scale applications, such as location-based photo browsing, landmark recognition, etc. Meanwhile, as the number of geo-tagged photos continues to climb, new challenges are brought to various applications. The existence of massive near-duplicate geotagged photos jeopardizes the effective presentation for the above applications. A new dimension in the search and presentation of geo-tagged photos is urgently demanded. In this paper, we devise a location visualization framework to efficiently retrieve and present diverse views captured within a local proximity. Novel photos, in terms of capture locations and visual content, are identified and returned in response to a query location for diverse visualization. For real-time response and good scalability, a new Hybrid Index structure which integrates R-tree and Geographic Grid is proposed to quickly identify the Maximal Near-duplicate Photo Groups (MNPG) in the query proximity. The most novel photos from different ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL groups are then returned to generate diverse views on the location. Extensive experiments on synthetic and real-life photo datasets prove the novelty and efficiency of our methods. Research 15: Data Trust 10:30 - 12PM Chair: Stefano Paraboschi (University of Bergamo) Bastille 2 Publicly Verifiable Grouped Aggregation Queries on Outsourced Data Streams WED/10 Suman Nath, Ramarathnam Venkatesan (Microsoft Research) Outsourcing data streams and desired computations to a third party such as the cloud is a desirable option to many companies. However, data outsourcing and remote computations intrinsically raise issues of trust, making it crucial to verify results returned by third parties. In this context, we propose a novel solution to verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by queries) that are common in many business applications. We consider a setting where a data owner employs an untrusted remote server to run continuous grouped aggregation queries on a data stream it forwards to the server. Untrusted clients then query the server for results and efficiently verify correctness of the results by using a small and easy-to-compute signature provided by the data owner. Our work complements previous works on authenticating remote computation of selection and aggregation queries. The most important aspect of our solution is that it is publicly verifiable---unlike most prior works, we support untrusted clients (who can collude with other clients or with the server). Experimental results on real and synthetic data show that our solution is practical and efficient. Trustworthy Data from Untrusted Databases Rohit Jain, Sunil Prabhakar (Purdue University) Ensuring the trustworthiness of data retrieved from a database is of utmost importance to users. The correctness of data stored in a database is defined by the faithful execution of only valid (authorized) transactions. In this paper we address the question of whether it is necessary to trust a database server in order to trust the data retrieved from it. The lack of trust arises naturally if the database server is owned by a third party, as in the case of cloud computing. It also arises if the server may have been compromised, or there is a malicious insider. In particular, we reduce the level of trust necessary in order to establish the authenticity and integrity of data at an untrusted server. Earlier work on this problem is limited to situations where there are no updates to the database, or all updates are authorized and vetted by a central trusted entity. This is an unreasonable assumption for a truly dynamic database, as would be expected in many business applications, where multiple clients can update data without having to check with a central server that approves of their changes. We identify the problem of ensuring trustworthiness of data at an ICDE 2013 Conference 69 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL untrusted server in the presence of transactional updates that run directly on the database, and develop the first solutions to this problem. Our solutions also provide indemnity for an honest server and assured provenance for all updates to the data. We implement our solution in a prototype system built on top of Oracle with no modifications to the database internals. We also provide an empirical evaluation of the proposed solutions and establish their feasibility. On the Relative Trust between Inconsistent Data and Inaccurate Constraints George Beskales, Ihab Ilyas (Qatar Computing Research Institute) Lukasz Golab, Artur Galiullin (University of Waterloo) Functional dependencies (FDs) specify the intended data semantics while violations of FDs indicate deviation from these semantics. In this paper, we study a data cleaning problem in which the FDs may not be completely correct, e.g., due to data evolution or incomplete knowledge of the data semantics. We argue that the notion of relative trust is a crucial aspect of this problem: if the FDs are outdated, we should modify them to fit the data, but if we suspect that there are problems with the data, we should modify the data to fit the FDs. In practice, it is usually unclear how much to trust the data versus the FDs. To address this problem, we propose an algorithm for generating non-redundant solutions (i.e., simultaneous modifications of the data and the FDs) corresponding to various levels of relative trust. This can help users determine the best way to modify their data and/or FDs to achieve consistency. WED/10 Research 16: Data on the Cloud 70 10:30 - 12PM Chair: Karl Aberer (EPFL) Concorde Catch the Wind: Graph Workload Balancing on Cloud Zechao Shang, Jeffrey Xu Yu (The Chinese University of Hong Kong) Graph partitioning is a key issue in graph database processing systems for achieving high efficiency on Cloud. How ever, the balanced graph partitioning itself is difficult because it is known to be NP-complete. In addition a static graph partitioning cannot keep all graph algorithms efficient for a long time in parallel on Cloud because the workload balancing in different iterations for different graph algorithms are all possible different. In this paper, we investigate graph behaviors by exploring the working window (we call it wind) changes, where a working window is a set of active vertices that a graph algorithm really needs to access in parallel computing. We investigated nine classic graph algorithms using real datasets, and propose simple yet effective policies that can achieve both high graph workload balancing and efficient partition on Cloud. ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud Xiaofei Zhang, Lei Chen, Yongxin Tong (Hong Kong University of Science and Technology) Min Wang (HP Labs China) To benefit from the Cloud platform's unlimited resources, effective management and query evaluation over huge volumes of RDF data in a scalable manner attracts intensive research efforts. Progresses have been made on evaluating SPARQL queries with either high-level declarative programming languages, like Pig and Sward, or simple MapReduce jobs, both of which tend to answer the query with multiple joins. However, due to the simplicity of Cloud storage and the coarse organization of RDF data in existing solutions, multiple join operations bring significant I/O traffic that severely degrades the system performance. In this work, we first propose EAGRE, an Entity Aware Graph compREssion technique to form a new representation of RDF data on Cloud platforms. Then based on an novel cost model, we propose an I/O efficient strategy to evaluate SPARQL queries as quickly as possible, especially queries with solution modifiers specified, e.g., PROJECTION, ORDER BY, etc. We implement a prototype system and conduct extensive experiments over both real and synthetic data sets on an in-house cluster. The experimental results show that our solution can achieve over an order of magnitude of time saving for the SPARQL query evaluation comparing to the state-of-art MapReduce-based solutions. ICDE 2013 Conference WED/10 C-Cube: Elastic Continuous Clustering in the Cloud Zhenjie Zhang (Advanced Digital Sciences Center) Hu Shu, Zhihong Chong (Southeast University, China) Hua Lu (Aalborg University) Yin Yang (Advanced Digital Sciences Center) Continuous clustering analysis over a data stream reports clustering results incrementally as updates arrive. Such analysis has a wide spectrum of applications, including traffic monitoring and topic discovery on microblogs. A common characteristic of streaming applications is that the amount of workload fluctuates, often in an unpredictable manner. On the other hand, most existing solutions for continuous clustering assume either a central server, or a distributed setting with a fixed number of dedicated servers. In other words, they are not elastic, meaning that they cannot dynamically adapt to the amount of computational resources to the fluctuating workload. Consequently, they incur considerable waste of resources, as the servers are under-utilized when the amount of workload is low. This paper proposes C-Cube, the first elastic approach to continuous streaming clustering. Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each new record to a processing unit, e.g., a virtual machine, based on its hash value. Each processing unit performs the required computations, and sends its results to a lightweight aggregator. This design enables dynamic adding/removing processing units, as well as replacing faulty ones and re-running their tasks. In addition to elasticity, C-Cube is also effective (in that it provides quality guarantees on the clustering 71 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL results), efficient (it minimizes the computational workload at all times), and generally applicable to a large class of clustering criteria. We implemented C-Cube in a real system based on Twitter Storm, and evaluated it using real and synthetic datasets. Extensive experimental results confirm our performance claims. Seminar 5: Sorting in Space: Multidimensional, Spatial, and Metric Data Structures for Applications in Spatial Databases, Geographic Information Systems (GIS), and Location-based Services 10:30 - 12PM Odeon Hanan Samet (University of Maryland) Techniques for representing multidimensional, spatial, and metric data for applications in spatial databases, geographic information systems (GIS), and location-based services are reviewed. This includes both geometric and textual representations of spatial data. SAP Business Lunch & ICDE Award Presentations 12 - 2 PM Chair: Rao Kotagiri (University of Melbourne) Ballroom Le Grand WED/10 Keynote 4: 10 Year Most Influential Papers 72 2 - 3PM Chair: Rao Kotagiri (University of Melbourne) Ballroom Le Grand Schema Mediation in Peer Data Management Systems [ICDE 2003] Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov (University of Washington) Intuitively, data management and data integration tools should be well-suited for exchanging information in a semantically meaningful way. Unfortunately, they suffer from two significant problems: they typically require a comprehensive schema design before they can be used to store or share information, and they are difficult to extend because schema evolution is heavyweight and may break backwards compatibility. As a result, many small-scale data sharing tasks are more easily facilitated by non-database-oriented tools that have little support for semantics. The goal of the peer data management system (PDMS) is to address this need: we propose the use of a decentralized, easily extensible data management architecture in which any user can contribute new data, schema information, or even mappings between other peers’ schemas. PDMSs represent a natural step beyond data integration systems, replacing their single logical schema with an interlinked collection of semantic mappings between peers’ individual schemas. This paper considers the problem of schema mediation in a PDMS. Our first contribution is ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL a flexible language for mediating between peer schemas, which extends known data integration formalisms to our more complex architecture. We precisely characterize the complexity of query answering for our language. Next, we describe a reformulation algorithm for our language that generalizes both global-as-view and local-as-view query answering algorithms. Finally, we describe several methods for optimizing the reformulation algorithm, and an initial set of experiments studying its performance. Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to Schema Matching [ICDE 2002] ICDE 2013 Conference WED/10 Sergey Melnik, Hector Garcia-Molina (Stanford University) Erhard Rahm (University of Leipzig) Matching elements of two data schemas or two data instances plays a key role in data warehousing, e-business, or even biochemical applications. In this paper we present a matching algorithm based on a fixpoint computation that is usable across different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data structures) as input, and produces as output a mapping between corresponding nodes of the graphs. Depending on the matching goal, a subset of the mapping is chosen using filters. After our algorithm runs, we expect a human to check and if necessary adjust the results. As a matter of fact, we evaluate the ‘accuracy’ of the algorithm by counting the number of needed adjustments. We conducted a user study, in which our accuracy metric was used to estimate the labor savings that the users could obtain by utilizing our algorithm to obtain an initial matching. Finally, we illustrate how our matching algorithm is deployed as one of several highlevel operators in an implemented testbed for managing information models and mappings. 73 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Research 17: Similarity Ranking 3:30 - 5PM Chair: Ihab Ilyas (Qatar Computing Research Institute) St Germaine WED/10 Efficient Search Algorithm for SimRank 74 Yasuhiro Fujiwara (NTT Software Innovation Center) Makoto Nakatsuji (NTT Service Evolution Laboratories) Hiroaki Shiokawa, Makoto Onizuka (NTT Software Innovation Center) Graphs are a fundamental data structure and have been employed to model objects as well as their relationships. The similarity of objects on the web (e.g., webpages, photos, music, micro-blogs, and social networking service users) is the key to identifying relevant objects in many recent applications. SimRank, proposed by Jeh and Widom, provides a good similarity score and has been successfully used in many applications such as web spam detection, collaborative tagging analysis, link prediction, and so on. SimRank computes similarities iteratively, and it needs O(N4T) time and O(N2) space for similarity computation where N and T are the number of nodes and iterations, respectively. Unfortunately, this iterative approach is computationally expensive. The goal of this work is to process top-k and range search efficiently for a given node. Our solution, SimMat, is based on two ideas: (1) It computes the approximate similarity of a selected node pair efficiently in non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary approximate similarity computations when searching for the high similarity nodes by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas reduce the time and space complexities of the proposed approach to O(Nn) where n is the target rank of the low-rank approximation (n << N in practice). Our experiments show that our approach is much faster, by several orders of magnitude, than previous approaches in finding the high similarity nodes. Towards Efficient SimRank Computation on Large Networks Weiren Yu (University of New South Wales/NICTA) Xuemin Lin (East China Normal Universit /University of New South Wales) Wenjie Zhang (University of New South Wales) SimRank has been a powerful model for assessing the similarity of pairs of vertices in a graph. It is based on the concept that two vertices are similar if they are referenced by similar vertices. Due to its self-referentiality, fast SimRank computation on large graphs poses significant challenges. The state-of-the-art work exploits partial sums memorization for computing SimRank in O(Kmn) time on a graph with n vertices and m edges, where K is the number of iterations. Partial sums memorizing can reduce repeated calculations by caching part of similarity summations for later reuse. However, we observe that computations among different partial sums may have redundancy. Besides, for a desired accuracy ε, the existing SimRank model requires K = ⌈logCε⌉ iterations, where C is a damping factor. Nevertheless, such a geometric rate of convergence is slow in practice if a high accuracy is desirable. In ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL this paper, we address these gaps. (1) We propose an adaptive clustering strategy to eliminate partial sums redundancy (i.e., duplicated computations occurring in partial sums), and devise an efficient algorithm for speeding up the computation of SimRank to O(Kd′n2) time, where d′ is typically much smaller than the average in-degree of a graph. (2) We also present a new notion of SimRank that is based on a differential equation and can be represented as an exponential sum of transition matrices, as opposed to the geometric sum of the conventional counterpart. This leads to a further speedup in the convergence rate of SimRank iterations. (3) Using real and synthetic data, we empirically verify that our approach of partial sums sharing outperforms the best known algorithm by up to one order of magnitude, and that our revised notion of SimRank further achieves a 5X speedup on large graphs while fairly preserving the relative order of original SimRank scores. RoundTripRank: Graph-based Proximity with Importance and Specificity ICDE 2013 Conference WED/10 Yuan Fang, Kevin Chen-Chuan Chang (University of Illinois at Urbana-Champaign / Advanced Digital Sciences Center) Hady W. Lauw (Singapore Management University) Graph-based proximity has many applications with different ranking needs. However, most previous works only stress the sense of importance by finding “popular” results for a query. Often times important results are overly general without being welltailored to the query, lacking a sense of specificity---which only emerges recently. Even then, the two senses are treated independently, and only combined empirically. In this paper, we generalize the well-studied importance-based random walk into a round trip and develop RoundTripRank, seamlessly integrating specificity and importance in one coherent process. We also recognize the need for a flexible trade-off between the two senses, and further develop RoundTripRank+ based on a scheme of hybrid random surfers. For efficient computation, we start with a basic model that decomposes RoundTripRank into smaller units. For each unit, we apply a novel two-stage bounds updating framework, enabling an online top-K algorithm 2SBound. Finally, our experiments show that RoundTripRank and RoundTripRank+ are robust over various ranking tasks, and 2SBound enables scalable online processing. 75 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Research 18: Spatial Databases 3:30 - 5PM Chair: Mourad Ouzzani (Qatar Computing Research Institute) Bastille 1 Finding Distance-Preserving Subgraphs in Large Road Networks Da Yan (Hong Kong University of Science and Technology) James Cheng (The Chinese University of Hong Kong) Wilfred Ng, Steven Liu (Hong Kong University of Science and Technology) Given two sets of points, S and T, in a road network, G, a distance-preserving subgraph (DPS) query returns a subgraph of G that preserves the shortest path from any point in S to any point in T. DPS queries are important in many real world applications, such as route recommendation systems, logistics planning, and all kinds of shortest-path-related applications that run on resource-limited mobile devices. In this paper, we study efficient algorithms for processing DPS queries in large road networks. Four algorithms are proposed with different tradeoffs in terms of DPS quality and query processing time, and the best one is a graph-partitioning based index, called RoadPart, that finds a high quality DPS with short response time. Extensive experiments on large road networks demonstrate the merits of our algorithms, and verify the efficiency of RoadPart for finding a high-quality DPS. WED/10 Maximum Visibility Queries in Spatial Databases 76 Sarah Masud, Farhana Murtaza Choudhury, Mohammed Eunus Ali (Bangladesh University of Engineering and Technology) Sarana Nutanong (University of Maryland) Many real-world problems, such as placement of surveillance cameras, pricing of hotel rooms with a view, require the ability to determine the visibility of a given target object from different locations. Advances in large-scale 3D modeling (e.g., 3D virtual cities) provide us with data that can be used to solve these problems with high accuracy. In this paper, we investigate the problem of finding the location which provides the best view of a target object with visual obstacles in 2D or 3D space, for example, finding the location that provides the best view of fireworks in a city with tall buildings. To solve this problem, we first define the quality measure of a view (i.e., visibility measure) as the visible angular size of the target object. Then, we propose a new query type called the k-Maximum Visibility (kMV) query, which finds k locations from a set of locations that maximize the visibility of the target object. Our objective in this paper is to design a query solution which is capable of handling large-scale city models. This objective precludes the use of approaches that rely on constructing a visibility graph of the entire data space. As a result, we propose three approaches that incrementally consider relevant obstacles in order to determine the visibility of a target object from a given set of locations. These approaches differ in the order of obstacle retrieval, namely: query centric distance based, query centric visible region based, and target centric distance based approaches. We have conducted an extensive experimental study on real 2D and 3D datasets to ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL demonstrate the efficiency and effectiveness of our solutions. Memory-Efficient Algorithms for Spatial Network Queries ICDE 2013 Conference WED/10 Sarana Nutanong, Hanan Samet (University of Maryland) Incrementally finding the k nearest neighbors (kNN) in a spatial network is an important problem in location-based services. One method (INE) simply applies Dijkstra’s algorithm. Another method (IER) computes the k nearest neighbors using Euclidean distance followed by computing their corresponding network distances, and then incrementally finds the next nearest neighbors in order of increasing Euclidean distance until finding one whose Euclidean distance is greater than the current k nearest neighbor in terms of network distance. The LBC method improves on INE by avoiding the visit of nodes that cannot possibly lead to the k nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding the repeated visits to nodes in the spatial network that appear on the shortest paths to different members of the k nearest neighbors by performing multiple instances of heuristic search using a Euclidean heuristic estimator on candidate objects around the query point. LBC’s drawback is that the maintenance of multiple instances of heuristic search (called wavefronts) requires k priority queues and the queue operations required to maintain them incur a high in-memory processing cost. A method (SWH) is proposed that utilizes a novel heuristic function which considers objects surrounding the query point together as a single unit, instead of as one destination at a time as in LBC, thereby eliminating the need for multiple wavefronts and needs just one priority queue. These results in a significant reduction in the in-memory processing cost components while having the same reduced cost of the access to the spatial network as LBC. SWH is also extended to support the incremental distance semi-join (IDSJ) query, which is a multiple query point generalization of the kNN query. In addition, SWH is shown to support landmarkbased heuristic functions, thereby enabling it to be applied to nonspatial networks/ graphs such as social networks. Comparisons of experiments on SWH for kNN queries with INE, the best single-wavefront method, show that SWH is 2.5 times faster, and with LBC, the best existing heuristic search method, show that SWH is 3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4 times faster than LBC-IDSJ. 77 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Research 19: Social Media II 3:30 - 5PM Chair: Tao Cheng (Microsoft) Location: Bastille 2 WED/10 A Unified Model for Stable and Temporal Topic Detection from Social Media Data 78 Hongzhi Yin, Bin Cui (Peking University) Hua Lu (Aalborg University) Yuxin Huang, Junjie Yao (Peking University) Web 2.0 users generate and spread huge amounts of messages in online social media. Such user-generated contents are mixture of temporal topics (e.g., breaking events) and stable topics (e.g., user interests). Due to their different natures, it is important and useful to distinguish temporal topics from stable topics in social media. However, such a discrimination is very challenging because the user-generated texts in social media are very short in length and thus lack useful linguistic features for precise analysis using traditional approaches. In this paper, we propose a novel solution to detect both stable and temporal topics simultaneously from social media data. Specifically, a unified user-temporal mixture model is proposed to distinguish temporal topics from stable topics. To improve this model’s performance, we design a regularization framework that exploits prior spatial information in a social network, as well as a burst-weighted smoothing scheme that exploits temporal prior information in the time dimension. We conduct extensive experiments to evaluate our proposals on two real data sets obtained from Del.icio.us and Twitter. The experimental results verify that our mixture model is able to distinguish temporal topics from stable topics in a single detection process. Our mixture model enhanced with the spatial regularization and the burst-weighted smoothing scheme significantly outperforms competitor approaches, in terms of topic detection accuracy and discrimination in stable and temporal topics. Crowdsourced Enumeration Queries Beth Trushkowsky, Tim Kraska, Michael Franklin, Purnamrita Sarkar (University of California Berkeley) Hybrid human/computer database systems promise to greatly expand the usefulness of query processing by incorporating the crowd for data gathering and other tasks. Such systems raise many implementation questions. Perhaps the most fundamental question is that the closed world assumption underlying relational query semantics does not hold in such systems. As a consequence the meaning of even simple queries can be called into question. Furthermore query progress monitoring becomes difficult due to non-uniformities in the arrival of crowdsourced data and peculiarities of how people work in crowdsourcing systems. To address these issues, we develop statistical tools that enable users and systems developers to reason about query completeness. These tools can also help drive query execution and crowdsourcing strategies. We evaluate our techniques using experiments on a popular crowdsourcing platform. ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL On Incentive-based Tagging Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung (The University of Hong Kong) A social tagging system, such as del.icio.us and Flickr, allows users to annotate resources (e.g., web pages and photos) with text descriptions called tags. Tags have proven to be invaluable information for searching, mining, and recommending resources. In practice, however, not all resources receive the same attention from users. As a result, while some highly popular resources are over-tagged, most of the resources are under-tagged. Incomplete tagging on resources severely affects the effectiveness of all tag-based techniques and applications. We address an interesting question: if users are paid to tag specific resources, how can we allocate incentives to resources in a crowd-sourcing environment so as to maximize the tagging quality of resources? We address this question by observing that the tagging quality of a resource becomes stable after it has been tagged a sufficient number of times. We formalize the concepts of tagging quality (TQ) and tagging stability (TS) in measuring the quality of a resource’s tag description. We propose a theoretically optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging resources). This solution decides the amount of rewards that should be invested on each resource in order to maximize tagging stability. We further propose a few simple, practical, and efficient incentive allocation strategies. On a dataset from del. icio.us, our best strategy provides resources with a close-to-optimal gain in tagging stability. WED/10 ICDE 2013 Conference 79 DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Research 20: Trees and XML 3:30 - 5PM Chair: Chengfei Liu (Swinburne University of Technology) Concorde Ontology-based Subgraph Querying WED/10 Yinghui Wu, Shengqi Yang, Xifeng Yan (University of California Santa Barbara) Subgraph querying has been applied in a variety of emerging applications. Traditional subgraph querying based on subgraph isomorphism requires identical label matching, which is often too restrictive to capture the matches that are semantically close to the query graphs. This paper extends subgraph querying to identify semantically related matches by leveraging ontology information. (1) We introduce the ontologybased subgraph querying, which revises subgraph isomorphism by mapping a query to semantically related subgraphs in terms of a given ontology graph. We introduce a metric to measure the closeness of the matches. Based on the metric, we further introduce an optimization problem to find top K closest matches. (2) We provide a filtering-and-verification framework to identify (top-K) matches for ontology-based subgraph queries. The framework efficiently extracts a small subgraph of the data graph from an ontology index, and further computes the matches by only accessing the extracted subgraph. (3) In addition, we show that the ontology index can be efficiently updated upon the changes to the data graphs, enabling the framework to cope with dynamic data graphs. (4) We experimentally verify the effectiveness and efficiency of our framework using both synthetic and real life graphs, comparing with traditional subgraph querying methods. 80 Stratification Driven Placement of Complex Data: A Framework for Distributed Data Analytics Ye Wang, Srinivasan Parthasarathy, P Sadayappan (The Ohio State University) With the increasing popularity of XML data stores, social networks and Web 2.0 and 3.0 applications, complex data formats, such as the trees and graphs are becoming ubiquitous. Managing and processing such large and complex data stores, on modern computational eco-systems, to realize actionable information efficiently is an important challenge. It is our hypothesis that a critical element at the heart of this challenge relates to the placement, storage and access of such tera- and peta- scale data. In this work we seek to develop a generic distributed framework to ease the burden on the programmer and propose an agile and intelligent placement service layer as a flexible yet unified means to address this challenge. Central to our framework is the notion of stratification which first attempts to identify groups of datum that are structurally (or semantically) related. Subsequently strata are partitioned within this eco-system according to the needs of the application to maximize locality, balance load, or minimize data skew. Results on several of realworld applications confirm the efficacy and efficiency of our approach. ICDE 2013 Conference DETAILED PROGRAM FOR WEDNESDAY 10 APRIL Optimizing Approximations of DNF Query Lineage in Probabilistic XML Asma Souihli, Pierre Senellart (Télécom ParisTech, CNRS LTCI) Probabilistic XML is a probabilistic model for uncertain tree-structured data, with applications to data integration, information extraction, or uncertain version control. We explore in this work efficient algorithms for evaluating tree-pattern queries with joins over probabilistic XML or, more specifically, for listing the answers to a query along with their computed or approximated probability. The approach relies on, first, producing the lineage query by evaluating it over the probabilistic XML document, and, second, looking for an optimal strategy to compute the probability of the lineage formula. This latter part relies on a query-optimizer–like approach: exploring different evaluation plans for different parts of the formula and estimating the cost of each plan, using a cost model for the various evaluation algorithms. We demonstrate the efficiency of this approach on datasets used in previous research on probabilistic XML querying, as well as on synthetic data. We also compare the performance of our query engine with EvalDP, Trio, and MayBMS/SPROUT. Seminar 6: Triples in the clouds ICDE 2013 Conference WED/10 3:30 - 5PM Odeon Zoi Kaoudi, Ioana Manolescu (Inria Saclay - Île de France / Université Paris-Sud) The W3C’s Resource Description Framework (or RDF, in short) is a promising candidate which may deliver many of the original semi-structured data promises: flexible structure, optional schema, and rich, flexible URIs as a basis for information sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific communities studying databases, knowledge representation, and Web technologies. Many RDF data collections are being published, going from scientific data to generalpurpose ontologies to open government data, in particular in the Linked Data movement. Managing such large volumes of RDF data is challenging, due to the sheer size, the heterogeneity, and the further complexity brought by RDF reasoning. To tackle the size challenge, distributed storage architectures are required. Cloud computing is an emerging paradigm massively adopted in many applications for the scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the problems involved in efficiently handling massive amounts of RDF data in a cloud environment. We provide the necessary background, analyze and classify existing solutions, and discuss open problems and perspectives. 81 DETAILED PROGRAM FOR THURSDAY 11 APRIL Thursday 11 April Keynote 3: 9 - 10AM Chair: Christian Jensen (Aarhus University) Ballroom 1 & 2 Hardware Killed the Software Star THU/11 Gustavo Alonso (ETH Zürich) Abstract: Until relatively recently, the development of data processing applications took place largely ignoring the underlying hardware. Only in niche applications (supercomputing, embedded systems) or in special software (operating systems, database internals, language runtimes) did (some) programmers had to pay attention to the actual hardware where the software would run. In most cases, working atop the abstractions provided by either the operating system or by system libraries was good enough. The constant improvements in processor speed did the rest. The new millennium has radically changed the picture. Driven by multiple needs –e.g., scale, physical constraints, energy limitations, virtualization, business models-- hardware architectures are changing at a speed and in ways that current development practices for data processing cannot accommodate. From now on, software will have to be built paying close attention to the underlying hardware and following strict performance engineering principles. In this talk, several aspects of the ongoing hardware revolution and its impact on data processing are analyzed, pointing to the need for new strategies to tackle the challenges ahead. 82 Bio: Gustavo Alonso is a professor at the Department of Computer Science at ETH Zurich in Switzerland, where he has been since 1995. At ETHZ, he is part of the Systems Group and the Enterprise Computing Center. Gustavo has a degree in electrical engineering from the Madrid Technical University in Spain and an M.S. and Ph.D. in Computer Science from UC Santa Barbara. Before joining ETH, he worked at the IBM Almaden Research Center. Gustavo’s research interests encompass almost all aspects of systems, from design to run time. Most of his research these days is related to multi-core architectures, large clusters, FPGAs, and cloud computing, with an emphasis on adapting traditional system software (OS, database, middleware) to these new hardware platforms. Gustavo is a Fellow of the ACM and Senior Member of the IEEE. He has been awarded the AOSD 2012 Most Influential Paper Award, the VLDB 2010 Ten Year Best Paper Award, and the ICDCS 2009 Best Paper Award for work on Remote Direct Memory Access. He has served in the VLDB Endowment, the ACM/IFIP/IEEE Middleware Steering Committe, as an associate editor of the VLDB Journal, as Chair of EuroSys , and as general chair or PC-chair/vice-chair in numerous conferences (VLDB, ICDE, Middleware, BPM, ICDCS, IEEE MDM). ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Research 21: Security and privacy 10:30 - 12PM Chair: Graham Cormode (AT&T Labs-Research) St Germaine Secure Nearest Neighbor Revisited Bin Yao (Shanghai Jiao Tong University) Feifei Li (University of Utah) Xiaokui Xiao (Nanyang Technological University) In this paper, we investigate the secure nearest neighbor (SNN) problem, in which a client issues an encrypted query point E(q) to a server and asks for an encrypted data point in E(D) that is closest to the query point, without allowing the server to learn the plaintexts of the data or the query (and its result). We show that efficient attacks exist for existing SNN methods, even though they were claimed to be secure in standard security models (such as indistinguishability under chosen plaintext or ciphertext attacks). We also establish a relationship between the SNN problem and the order-preserving encryption (OPE) problem from the cryptography field, and we show that SNN is at least as hard as OPE. Since it is impossible to construct secure OPE schemes in standard security models, our results imply that one cannot expect to find the exact (encrypted) nearest neighbor based on only E(q) and E(D). Given this hardness result, we design new SNN methods by asking the server, given only E(q) and E(D), to return a relevant (encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed to contain the answer for the SNN query. Our methods provide customizable tradeoff between efficiency and communication cost, and they are as secure as the encryption scheme E used to encrypt the query and the database, where E can be any well-established encryption schemes. Accurate and Efficient Private Release of Datacubes and Contingency Tables ICDE 2013 Conference THU/11 Grigory Yaroslavtsev (Pennsylvania State University) Graham Cormode, Cecilia M. Procopiuc, Divesh Srivastava (AT&T Labs - Research) A central problem in releasing aggregate information about sensitive data is to do so accurately while providing a privacy guarantee on the output. Recent work focuses on the class of linear queries, which include basic counting queries, data cubes, and contingency tables. The goal is to maximize the utility of their output, while giving a rigorous privacy guarantee. Most results follow a common template: pick a ``strategy’’ set of linear queries to apply to the data, then use the noisy answers to these queries to reconstruct the queries of interest. This entails either picking a strategy set that is hoped to be good for the queries, or performing a costly search over the space of all possible strategies. However, once the strategy is fixed, its evaluation can be done efficiently, using standard linear algebraic methods. In this paper, we propose a new approach that balances accuracy and efficiency: we show how to optimize the accuracy of a given strategy by answering some strategy queries more accurately than others, based on the target queries. This leads to an 83 DETAILED PROGRAM FOR THURSDAY 11 APRIL efficient optimal noise allocation for many popular strategies, including wavelets, hierarchies, Fourier coefficients and more. For the important case of marginal queries (equivalently, subsets of the data cube), we show that this strictly improves on previous methods, both analytically and empirically. Our results also extend to ensuring that the returned query answers are consistent with an (unknown) data set at minimal extra cost in terms of time and noise. Differentially Private Grids for Geospatial Data THU/11 Wahbeh Qardaji, Weining Yang, Ninghui Li (Purdue University) In this paper, we tackle the problem of constructing a differentially private synopsis for two-dimensional datasets such as geospatial datasets. The current state-of-the-art methods work by performing recursive binary partitioning of the data domains, and constructing a hierarchy of partitions. We show that the key challenge in partitionbased synopsis methods lies in choosing the right partition granularity to balance the noise error and the non-uniformity error. We study the uniform-grid approach, which applies an equi-width grid of a certain size over the data domain and then issues independent count queries on the grid cells. This method has received no attention in the literature, probably due to the fact that no good method for choosing a grid size was known. Based on an analysis of the two kinds of errors, we propose a method for choosing the grid size. Experimental results validate our method, and show that this approach performs as well as, and often times better than, the state-of-the-art methods. We further introduce a novel adaptive-grid method. The adaptive grid method lays a coarse-grained grid over the dataset, and then further partitions each cell according to its noisy count. Both levels of partitions are then used in answering queries over the dataset. This method exploits the need to have finer granularity partitioning over dense regions and, at the same time, coarse partitioning over sparse regions. Through extensive experiments on real-world datasets, we show that this approach consistently and significantly outperforms the uniform-grid method and other state-of-the-art methods. 84 ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Research 22: Randomized Algorithms for Graphs 10:30 - 12PM Chair: Yinghui Wu (University of California Santa Barbara) Bastille 1 Faster Random Walks By Rewiring Online Social Networks On-The-Fly Zhuojie Zhou, Nan Zhang (George Washington University) Zhiguo Gong (University of Macau) Gautam Das (University of Texas at Arlington / Qatar Computing Research Institute) Many online social networks feature restrictive web interfaces which only allow the query of a user’s local neighborhood through the interface. To enable analytics over such an online social network through its restrictive web interface, many recent efforts reuse the existing Markov Chain Monte Carlo methods such as random walks to sample the social network and support analytics based on the samples. The problem with such an approach, however, is the large amount of queries often required (i.e., a long ``mixing time’’) for a random walk to reach a desired (stationary) sampling distribution. In this paper, we consider a novel problem of enabling a faster random walk over online social networks by "rewiring" the social network on-the-fly. Specifically, we develop Modified TOpology (MTO)-Sampler which, by using only information exposed by the restrictive web interface, constructs a ``virtual’’ overlay topology of the social network while performing a random walk, and ensures that the random walk follows the modified overlay topology rather than the original one. We show that MTO-Sampler not only provably enhances the efficiency of sampling, but also achieves significant savings on query cost over realworld online social networks such as Google Plus, Epinion etc. Sampling Node Pairs Over Large Graphs ICDE 2013 Conference THU/11 Pinghui Wang (The Chinese University of Hong Kong) Junzhou Zhao (Xi'an Jiaotong University) John C.S. Lui (The Chinese University of Hong Kong) Don Towsley (University of Massachusetts Amherst) Xiaohong Guan (Xi'an Jiaotong University / Tsinghua University) Characterizing user pair relationships is important for applications such as friend recommendation and interest targeting in online social networks (OSNs). Due to the large scale nature of such networks, it is infeasible to enumerate all user pairs and so sampling is used. In this paper, we show that it is a great challenge even for OSN service providers to characterize user pair relationships even when they posses the complete graph topology. The reason is that when sampling techniques (i.e., uniform vertex sampling (UVS) and random walk (RW)) are naively applied, they can introduce large biases, in particular, for estimating similarity distribution of user pairs with constraints such as existence of mutual neighbors, which is important for applications such as identifying network homophily. Estimating statistics of user pairs is more challenging in the absence of the complete topology information, since an unbiased sampling technique such as UVS is usually not allowed, and exploring the OSN graph topology is expensive. To address these challenges, we present 85 DETAILED PROGRAM FOR THURSDAY 11 APRIL asymptotically unbiased sampling methods to characterize user pair properties based on UVS and RW techniques respectively. We carry out an evaluation of our methods to show their accuracy and efficiency. Finally, we apply our methods to two Chinese OSNs, Doudan and Xiami, and discover significant homophily is present in these two networks. Link Prediction across Networks by Biased Cross-Network Sampling Guo-Jun Qi (University of Illinois at Urbana-Champaign) Charu C. Aggarwal (IBM T.J. Watson Research Center) Thomas Huang (University of Illinois at Urbana-Champaign) The problem of link inference has been widely studied in a variety of social networking scenarios. In this problem, we wish to predict future links in a growing network with the use of the existing network structure. However, most of the existing methods work well only if a significant number of links are already available in the network for the inference process. In many scenarios, the existing network may be too sparse, and may have too few links to enable meaningful learning mechanisms. This paucity of linkage information can be challenging for the link inference problem. However, in many cases, other (more densely linked) networks may be available which show similar linkage structure in terms of underlying attribute information in the nodes. The linkage information in the existing networks can be used in conjunction with the node attribute information in both networks in order to make meaningful link recommendations. Thus, this paper introduces the use of transfer learning methods for performing cross-network link inference. We present experimental results illustrating the effectiveness of the approach. Research 23: Distributed Data Processing 10:30 - 12PM Chair: Tyson Condie (Microsoft) Bastille 2 THU/11 Interval Indexing and Querying on Key-Value Cloud Stores 86 George Sfakianakis, Ioannis Patlakas, Nikos Ntarmos, Peter Triantafillou (University of Patras) Cloud key-value stores are becoming increasingly more important. Challenging applications, requiring efficient and scalable access to massive data, arise every day. We focus on supporting interval queries (which are prevalent in several data intensive applications, such as temporal querying for temporal analytics), an efficient solution for which is lacking. We contribute a compound interval index structure, comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that stores information for interval endpoints. In addition to the above, our contributions include: (i) algorithms for efficiently constructing and populating our indices using MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii) algorithms for processing interval queries. We have implemented all algorithms using ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL HBase and Hadoop, and conducted a detailed performance evaluation. We quantify the costs associated with the construction of the indices, and evaluate our query processing algorithms using queries on real data sets. We compare the performance of our approach to two alternatives: the native support for interval queries provided in HBase, and the execution of such queries using the Hive query execution tool. Our results show a significant speedup, far outperforming the state of the art. Robust Distributed Stream Processing Chuan Lei, Elke A. Rundensteiner, Joshua D. Guttman (Worcester Polytechnic Institute) Distributed stream processing systems must function efficiently for data streams that fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively expensive load re-allocation across machines may make these systems ineffective, potentially resulting in data loss or even system failure. To overcome this problem, we instead propose a load distribution (RLD) strategy that is robust to data fluctuations. RLD provides e-optimal query performance under load fluctuations without suffering from the performance penalty caused by load migration. RLD is based on three key strategies. First, we model robust distributed stream processing as a parametric query optimization problem. The notions of robust logical and robust physical plans then are overlays of this parameter space. Second, our Earlyterminated Robust Partitioning (ERP) finds a set of robust logical plans, covering the parameter space, while minimizing the number of prohibitively expensive optimizer calls with a probabilistic bound on the space coverage. Third, our OptPrune algorithm maps the space-covering logical solution to a single robust physical plan tolerant to deviations in data statistics that maximizes the parameter space coverage at runtime. Our experimental study using stock market and sensor networks streams demonstrates that our RLD methodology consistently outperforms stateof-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data stream environments. Research 24: Data Mining II 10:30 - 12PM Chair: Jian Pei (Simon Fraser University) Concorde Learning to Rank from Distant Supervision: Exploiting Noisy Redundancy for Relational Entity Search ICDE 2013 Conference THU/11 Mianwei Zhou, Hongning Wang, Kevin Chen-Chuan Chang (University of Illinois at Urbana-Champaign / Advanced Digital Sciences Center) In this paper, we propose to study the task of relational entity search which aims at automatically learning an entity ranking function for a desired relation. To rank the entities, we exploit the redundancy buried in their snippets; however, such redundancy is noisy as not all the snippets represent information relevant to the desired relation. To explore useful information from such noisy redundancy, we 87 DETAILED PROGRAM FOR THURSDAY 11 APRIL abstract the task as a distantly supervised ranking problem -- based on coarse entity-level annotations, deriving a relation-specific ranking function for online searching purpose. As the key challenge, without detailed snippet-level annotations, we have to filter noisy snippets for estimating an accurate ranking function; furthermore, the ranking function should also be online executable. We develop Pattern-based Filter Network (PFNet), a novel probabilistic graphical model, as our solution. To balance accuracy and efficiency requirement, PFNet selects a limited size of indicative patterns to filter noisy snippets, and the inverted index is utilized to retrieve required features. Experiments on the large scale CuleWeb09 data set for six different relations confirm the effectiveness of the proposed PFNet model, which outperforms five state-of-the-art relational entity ranking methods. AFFINITY: Efficiently Querying Statistical Measures on Time-Series Data Saket Sathe, Karl Aberer (École Polytechnique Fédérale de Lausanne) Computing statistical measures for large databases of time series is a fundamental primitive for querying and mining time-series data. This primitive is gaining importance with the increasing number and rapid growth of time series databases. In this paper, we introduce a framework for efficient computation of statistical measures by exploiting the concept of affine relationships. Affine relationships can be used to infer statistical measures for time series, from other related time series, instead of computing them directly; thus, reducing the overall computational cost significantly. The resulting methods exhibit at least one order of magnitude improvement over the best known methods. To the best of our knowledge, this is the first work that presents a unified approach for computing and querying several statistical measures at once. Our approach exploits affine relationships using three key components. First, the AFCLST algorithm clusters the time-series data, such that high-quality affine relationships could be easily found. Second, the SYMEX algorithm uses the clustered time series and efficiently computes the desired affine relationships. Third, the SCAPE index structure produces a many-fold improvement in the performance of processing several statistical queries by seamlessly indexing the affine relationships. Finally, we establish the effectiveness of our approaches by performing comprehensive experimental evaluation on real datasets. THU/11 Forecasting the Data Cube: A Model Configuration Advisor for Multi-Dimensional Data Sets 88 Ulrike Fischer, Christopher Schildt, Claudio Hartmann, Wolfgang Lehner (Dresden University of Technology) Forecasting time series data is crucial in a number of domains such as supply chain management and display advertisement. In these areas, the time series to forecast is typically organized along multiple dimensions leading to a high number of time series that need to be forecasted. Most current approaches focus only on selection and optimizing a forecast model for a single time series. In this paper, we explore how ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL we can utilize time series at different dimensions to increase forecast accuracy and, optionally, reduce model maintenance overhead. Solving this problem is challenging due to the large space of possibilities and possible high model creation costs. We propose a model configuration advisor that automatically determines the best set of models, a model configuration, for a given multi-dimensional data set. Our approach is based on a general process that iteratively examines more and more models and simultaneously controls the search space depending on the data set, model type and available hardware. The final model configuration is integrated into F2DB, an extension of PostgreSQL, that processes forecast queries and maintains the configuration as new data arrives. We comprehensively evaluated our approach on real and synthetic data sets. The evaluation shows that our approach significantly increases forecast query accuracy while ensuring low model costs. Seminar 7: Querying Encrypted Data 10:30 - 12PM Odeon Arvind Arasu, Ken Eguro, Raghav Kaushik, Ravi Ramamurthy (Microsoft Research) Data security is a serious concern when we migrate data to a cloud DBMS. Database encryption, where sensitive columns are encrypted before they are stored in the cloud, has been proposed as a mechanism to address such data security concerns. The intuitive expectation is that an adversary cannot ``learn’’ anything about the encrypted columns, since she does not have access to the encryption key. However, query processing becomes a challenge since it needs to ``look inside’’ the data. This tutorial explores the space of designs studied in prior work on processing queries over encrypted data. We cover approaches based on both classic clientserver and involving the use of a trusted hardware module where data can be securely decrypted. We discuss the privacy challenges that arise in both approaches and how they may be addressed. Briefly, supporting the full complexity of a modern DBMS including complex queries, transactions and stored procedures leads to significant challenges that we survey. Panel: Big Data for the Public 10:30 - 12PM Moderator: Dimitrios Georgakopoulos (CSIRO, Australia) Ballroom 1-2 While data are now being produced and collected on unprecedented scales, most of the “big data” remain inaccessible or difficult to use by the public. For example, companies fervently guard the data they collect despite the potential for greater public good. Lots of government data are supposedly public, but they are hard to ICDE 2013 Conference THU/11 Panelists: Karl Aberer (École Polytechnique Fédérale de Lausanne) Ashraf Aboulnaga (University of Waterloo), Kevin Chang (University of Illinois at Urbana-Champaign), Xin Luna Dong (Google Inc.) 89 DETAILED PROGRAM FOR THURSDAY 11 APRIL access or analyze. Even if data are readily accessible (such as Web), obtaining reliable information from noisy, high-volume, and heterogeneous data sources remains a daunting task for the majority of the public. This panel is about the challenges in fully realizing big data’s potential impact on the public. Topics of interest include data quality, integration, privacy, as well as infrastructure, platform, and application support for making access and analysis easier. Panelists will also discuss issues in the practice of big-data research---how limited access to real data and workloads affects reproducibility and robustness of research results, and how we can measure research success and impact on the public. Demo Groups 3 & 4 10:30 - 12PM Ballroom 3 Πgora: An Integration System for Probabilistic Data Dan Olteanu, Lampros Papageorgiou, Sebastiaan J. van Schaik (University of Oxford) Πgora is an integration system for probabilistic data modelled using different formalisms such as pc-tables, Bayesian networks, and stochastic automata. User queries are expressed over a global relational layer and are evaluated by Πgora using a range of strategies, including data conversion into one probabilistic formalism followed by evaluation using an engine for that formalism, and hybrid plans, where subqueries are evaluated using engines for different formalisms. This demonstration allows users to experience Πgora on realworld heterogeneous data sources from the medical domain. THU/11 Complex Pattern Matching in Complex Structures: the XSeq Approach 90 Kai Zeng, Mohan Yang (University of California, Los Angeles) Barzan Mozafari (Massachusetts Institute of Technology) Carlo Zaniolo (University of California, Los Angeles) There is much current interest in applications of complex event processing over data streams and of complex pattern matching over stored sequences. While some applications use streams of flat records, XML and various semi-structured information formats are preferred by many others—in particular, applications that deal with domain science, social networks, RSS feeds, and finance. XSeq and its system improve complex pattern matching technology significantly, both in terms of expressive power and efficient implementation. XSeq achieves higher expressiveness through an extension of XPath based on Kleene-* pattern constructs, and achieves very efficient execution, on both stored and streaming data, using Visibly Pushdown Automata (VPA). In our demo, we will (i) show examples of XSeq in different application domains, (ii) explain its compilation/query optimization techniques and show the speed-ups they deliver, and (iii) demonstrate how powerful and efficient application-specific languages were implemented by superimposing simple `skins’ on XSeq and its system. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL T-Music: A Melody Composer based on Frequent Pattern Mining Cheng Long, Raymond Chi-Wing Wong, Raymond Ka Wai Sze (The Hong Kong University of Science and Technology) There are a bulk of studies on proposing algorithms for composing the melody of a song automatically with algorithms, which is known as algorithmic composition. To the best of our knowledge, none of them took the lyric into consideration for melody composition. However, according to some recent studies, within a song, there usually exists a certain extent of correlation between its melody and its lyric. In this demonstration, we propose to utilize this type of correlation information for melody composition. Based on this idea, we design a new melody composition algorithm and develop a melody composer called T-Music which employs this composition algorithm. SHARE: Secure Information Sharing Framework for Emergency Management Barbara Carminati, Elena Ferrari, Michele Guglielmi (University of Insubria) 9/11, Katrina, Fukushima and other recent emergencies demonstrate the need for effective information sharing across government agencies as well as nongovernmental and private organizations to assess emergency situations, and generate proper response plans. In this demo, we present a system to enforce timely and controlled information sharing in emergency situations. The framework is able to detect emergencies, enforce temporary access control policies and obligations to be activated during emergencies, simulate emergency situations for demonstrational purposes and show statistical results related to emergency activation/deactivation and consequent access control policies triggering. KORS: Keyword-aware Optimal Route Search System ICDE 2013 Conference THU/11 Xin Cao, Lisi Chen, Gao Cong (Nanyang Technological University) Jihong Guan (Tongji University) Nhan-Tue Phan, Xiaokui Xiao (Nanyang Technological University) We present the Keyword-aware Optimal Route Search System (KORS), which efficiently answers the KOR queries. A KOR query is to find a route such that it covers a set of user-specified keywords, a specified budget constraint is satisfied, and an objective score of the route is optimized. Consider a tourist who wants to spend a day exploring a city. The user may issue the following KOR query: “a most popular route such that it passes by shopping mall, restaurant, and pub, and the travel time to and from her hotel is within 4 hours.” KORS provides browser-based interfaces for desktop and laptop computers and provides a client application for mobile devices as well. The interfaces and the client enable users to formulate queries and view the query results on a map. Queries are then sent to the server for processing by the HTTP post operation. Since answering a KOR query is NP-hard, we devise two approximation algorithms with provable performance bounds and one greedy algorithm to process the KOR queries in our KORS prototype. We use two real- 91 DETAILED PROGRAM FOR THURSDAY 11 APRIL world datasets to demonstrate the functionality and performance of this system. CrowdPlanr: Planning Made Easy with Crowd Ilia Lotosh, Tova Milo, Slava Novgorodov (Tel-Aviv University) Recent research has shown that crowd sourcing can be used effectively to solve problems that are difficult for computers, e.g., optical character recognition and identification of the structural configuration of natural proteins. In this demo we propose to use the power of the crowd to address yet another difficult problem that frequently occurs in a daily life - planning a sequence of actions, when the goal is hard to formalize. For example, planning the sequence of places/attractions to visit in the course of a vacation, where the goal is to enjoy the resulting vacation the most, or planning the sequence of courses to take in an academic schedule planning, where the goal is to obtain solid knowledge of a given subject domain. Such goals may be easily understandable by humans, but hard or even impossible to formalize for a computer. We present a novel algorithm for efficiently harnessing the crowd to assist in solving such planning problems. The algorithm builds the desired plans incrementally, optimally choosing at each step the `best’ questions so that the overall number of questions that need to be asked is minimized. We demonstrate the effectiveness of our solution in CrowdPlanr, a system for vacation travel planning. Given a destination, dates, preferred activities and other constraints CrowdPlanr employs the crowd to build a vacation plan (sequence of places to visit) that is expected to maximize the “enjoyment” of the vacation. THU/11 ASVTDECTOR: A Practical Near Duplicate Video Retrieval System 92 Xiangmin Zhou (CSIRO ICT Center) Lei Chen (Hong Kong University of Science and Technology) In this paper, we present a system, named ASVTDECTOR, to retrieve the near duplicate videos with large variations based on an 3D structure tensor model, named ASVT series, over the local descriptors of video segments. Different from the traditional global feature-based video detection systems that incur severe information loss, ASVT model is built over the local descriptor set of each video segment, keeping the robustness of local descriptors. Meanwhile, unlike the traditional local feature-based methods that suffer from the high cost of pairwise descriptor comparison, ASVT model describes a video segment as an 3D structure tensor that is actually an 3X3 matrix, obtaining high retrieval efficiency. In this demonstration, we show that, given a clip, our ASVTDETECTOR system can effectively find the near-duplicates with large variations from a large collection in real time. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL YumiInt - A Deep Web Integration System for Local Search Engines for Georeferenced Objects Eduard Dragut (Purdue University) B. P. Beirne, A. Neyestani, B. Atassi, Clement Yu, Bhaskar DasGupta (University of Illinois at Chicago) Weiyi Meng (Binghamton University) We present YumiInt a deep Web integration system for local search engines for Geo-referenced objects. YumiInt consists of two systems: YumiDev and YumiMeta. YumiDev is an off-line integration system that builds the key components (e.g., query translation and entity resolution) of YumiMeta. YumiMeta is the Web application to which users post queries. It translates queries to multiple sources and gets back aggregated lists of results. We present the two systems in this paper. A Demonstration of the G* Graph Database System Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan Labouseur, Paul W. Olsen Jr., Jayadevan Vijayan, Jeong-Hyon Hwang (University at Albany - State University of New York), Jun-Weon Yoon (KISTI Supercomputing Center) The world is full of evolving networks, many of which can be represented by a series of large graphs. Neither the current graph processing systems nor database systems can efficiently store and query these graphs due to their lack of support for managing multiple graphs and lack of essential graph querying capabilities. We propose to demonstrate our system, G*, that meets the new challenges of managing multiple graphs and supporting fundamental graph querying capabilities. G* can store graphs on a large number of servers while compressing these graphs based on their commonalities. G* also allows users to easily express queries on graphs and efficiently executes these queries by sharing computations across graphs. During our demonstrations, conference attendees will run various analytic queries on large, practical data sets. These demonstrations will highlight the convenience and performance benefits of G* over existing database and graph processing systems, the effectiveness of sharing in graph data storage and processing, as well as G*’s scalability. RECODS: Replica Consistency-On-Demand Store ICDE 2013 Conference THU/11 Yuqing Zhu (Tsinghua University) Philip S. Yu (University of Illinois at Chicago) Jianmin Wang (Tsinghua University) Replication is critical to the scalability, availability and reliability of large-scale systems. The trade-off of replica consistency vs. response latency has been widely understood for large-scale stores with replication. The weak consistency guaranteed by existing large-scale stores complicates application development, while the strong consistency hurts application performance. It is desirable that the best consistency be guaranteed for a tolerable response latency, but none of existing large-scale stores supports maximized replica consistency within a given latency constraint. In this demonstration, we showcase RECODS (REplica Consistency-On-Demand 93 DETAILED PROGRAM FOR THURSDAY 11 APRIL Store), a NoSQL store implementation that can finely control the trade-off on an operation basis and thus facilitate application development with on-demand replica consistency. With RECODS, developers can specify the tolerable latency for each read/write operation. Within the specified latency constraint, a response will be returned and the replica consistency be maximized. RECODS implementation is based on Cassandra, an open source NoSQL store, but with a different operation execution process, replication process and in-memory storage hierarchy. SODIT: An Innovative System for Outlier Detection using Multiple Localized Thresholding and Interactive Feedback Ji Zhang, Hua Wang, Xiaohui Tao, Lili Sun (University of Southern Queensland) Outlier detection is an important long-standing research problem in data mining and has enjoyed applications in a wide range of applications in business, engineering, biology and security, etc. However, the traditional outlier detection methods inevitably need to use different parameters for detection such as those used to specify the distance or density cutoff for distinguish outliers from normal data points. Using the trial and error approach, the traditional outlier detection methods are rather tedious in parameter tuning. In this demo proposal, we introduce an innovative outlier detection system, called SODIT, that uses localized thresholding to assist the value specification of the thresholds that reflect closely the local data distribution. In addition, easy-to-use user feedback are employed to further facilitate the determination of optimal parameter values. SODIT is able to make outlier detection much easier to operate and produce more accurate, intuitive and informative results than before. THU/11 COLA: A Cloud-based System for Online Aggregation 94 Yantao Gan, Xiaofeng Meng, Yingjie Shi (Renmin University of China) Online aggregation is a promising solution to achieving fast early responses for interactive ad-hoc queries that compute aggregates on massive data. To process large datasets on large-scale computing clusters, MapReduce has been introduced as a popular paradigm into many data analysis applications. However, typical MapReduce implementations are not well-suited to analytic tasks, since they are geared towards batch processing. With the increasing popularity of ad-hoc analytic query processing over enormous datasets, processing aggregate queries using MapReduce in an online fashion is therefore an emerging important application need. We present a MapReduce-based online aggregation system called COLA, which provides progressive approximate aggregate answers for both single table and multiple joined tables. COLA provides an online aggregation execution engine with novel sampling techniques to support incremental and continuous computing of aggregation, and minimize the waiting time before an acceptably precise estimate is available. In addition, user-friendly SQL queries are supported in COLA. Furthermore, COLA can implicitly convert non-OLA jobs into online version so that ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL users don’t have to write any special-purpose code to make estimates. RoadAlarm: A Spatial Alarm System on Road Networks Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, Calton Pu (Georgia Institute of Technology) Spatial alarms are one of the fundamental functionality for many LBSs. We argue that spatial alarms should be road network aware as mobile objects travel on spatially constrained road networks or walk paths. In this software system demonstration, we will present the first prototype system of RoadAlarm - a spatial alarm processing system for moving objects on road networks. The demonstration system of RoadAlarm focuses on the three unique features of RoadAlarm system design. First, we will show that the road network distance-based spatial alarm is best modeled using road network distance such as segment length-based and travel time-based distance. Thus, a road network spatial alarm is a star-like subgraph centered at the alarm target. Second, we will show the suite of RoadAlarm optimization techniques to scale spatial alarm processing by taking into account spatial constraints on road networks and mobility patterns of mobile subscribers. Third, we will show that by equipping the RoadAlarm system with an activity monitoring-based control panel, we are able to enable the system administrator and the end users to visualize road network-based spatial alarms, mobility traces of moving objects and dynamically make selection or customization of the RoadAlarm techniques for spatial alarm processing through graphical user interface. We show that the RoadAlarm system provides both the general system architecture and the essential building blocks for location-based advertisements and location-based reminders. Real-time Abnormality Detection System for Intensive Care Management ICDE 2013 Conference THU/11 Guangyan Huang, Jing He (Victoria University) Jie Cao (Nanjing University of Finance and Economics) Zhi Qiao (Chinese Academy of Sciences / Victoria University) Michael Steyn (Royal Brisbane and Women's Hospital / Victoria University) Kersi Taraporewalla (Royal Brisbane and Women's Hospital) Detecting abnormalities from multiple correlated time series is valuable to those applications where a credible real-time event prediction system will minimize economic losses (e.g. stock market crash) and save lives (e.g. medical surveillance in the operating theatre). For example, in an intensive care scenario, anesthetists perform a vital role in monitoring the patient and adjusting the flow and type of anesthetics to the patient during an operation. An early awareness of possible complications is vital for an anesthetist to correctly react to a given situation. In this demonstration, we provide a comprehensive medical surveillance system to effectively detect abnormalities from multiple physiological data streams for assisting online intensive care management. Particularly, a novel online support vector regression (OSVR) algorithm is developed to approach the problem of discovering the abnormalities from multiple correlated time series for accuracy and real-time 95 DETAILED PROGRAM FOR THURSDAY 11 APRIL efficiency. We also utilize historical data streams to optimize the precision of the OSVR algorithm. Moreover, this system comprises a friendly user interface by integrating multiple physiological data streams and visualizing alarms of abnormalities. Research 25: Lineage and Provenance 1:30 - 3PM Chair: Zach Ives (University of Pennsylvania) St Germaine SubZero: a Fine-Grained Lineage System for Scientific Databases Eugene Wu, Samuel Madden, Michael Stonebraker (Massachusetts Institute of Technology) Data lineage is a key component of provenance that helps scientists track and query relationships between input and output data. While current systems readily support lineage relationships at the file or data array level, finer-grained support at an arraycell level is impractical due to the lack of support for user defined operators and the high runtime and storage overhead to store such lineage. We interviewed scientists in several domains to identify a set of common semantics that can be leveraged to efficiently store fine-grained lineage. We use the insights to define lineage representations that efficiently capture common locality properties in the lineage data, and a set of APIs so operator developers can easily export lineage information from user defined operators. Finally, we introduce two benchmarks derived from astronomy and genomics, and show that our techniques can reduce lineage query costs by up to 10× while incurring substantially less impact on workflow runtime and storage. THU/11 Logical Provenance in Data-Oriented Workflows 96 Robert Ikeda, Akash Das Sarma, Jennifer Widom (Stanford University) We consider the problem of defining, generating, and tracing provenance in data-oriented workflows, in which input data sets are processed by a graph of transformations to produce output results. We first give a new general definition of provenance for general transformations, introducing the notions of correctness, precision, and minimality. We then determine when properties such as correctness and minimality carry over from the individual transformations’ provenance to the workflow provenance. We describe a simple logical-provenance specification language consisting of attribute mappings and filters. We provide an algorithm for provenance tracing in workflows where logical provenance for each transformation is specified using our language. We consider logical provenance in the relational setting, observing that for a class of Select-Project-Join (SPJ) transformations, logical provenance specifications encode minimal provenance. We have built a prototype system supporting the features and algorithms presented in the paper, and we report a few preliminary experimental results. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Revision Provenance in Text Documents of Asynchronous Collaboration Jing Zhang (Twitter) H.V. Jagadish (University of Michigan) Many text documents today are collaboratively edited, often with multiple small changes. The problem we consider in this paper is how to find provenance for a specific part of interest in the document. A full revision history, represented as a version tree, can tell us about all updates made to the document, but most of these updates may apply to other parts of the document, and hence not be relevant to answer the provenance question at hand. In this paper, we propose the notion of a revision unit as a flexible unit to capture the necessary provenance. We demonstrate through experiments the capability of the revision units in keeping only relevant updates in the provenance representation and the flexibility of the revision units in adjusting to updates reflected in the version tree. Research 26: Similarity Search 1:30 - 3PM Chair: Tingjian Ge (University of Massachusetts at Lowell) Bastille 1 Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search ICDE 2013 Conference THU/11 Chengyuan Zhang, Ying Zhang, Wenjie Zhang (University of New South Wales) Xuemin Lin (University of New South Wales / East China Normal University) With advances in geo-positioning technologies and geo-location services, there is a rapidly growing amount of spatio-textual objects collected in many applications such as web search, location based service and social network service, in which an object is described by its spatial location and a set of keywords (terms). Consequently, the study of spatial keyword search which explores both location and textual description of the objects has attracted great attention from the commercial organizations and research communities. In the paper, we study the problem of top k spatial keyword search (TOPK-SK), which is fundamental in the spatial keyword queries. Given a set of spatio-textual objects, a query location and a set of keywords, the top k spatial keyword search retrieves the closest k objects each of which contains all keywords in the query. Based on the inverted index and linear quadtree techniques, in the paper, we propose a novel index structure, called inverted linear quadtree (IL-Quadtree), which is carefully designed to facilitate both spatial and keyword based filtering to reduce the search space. An efficient algorithm is then developed to tackle top k spatial keyword search. In addition, we show that the IL-Quadtree technique can also be applied to improve the performance of other spatial keyword queries such as the direction-aware top k spatial keyword search and the spatio-textual ranking query. Comprehensive experiments on real and synthetic data clearly demonstrate the efficiency of our methods. 97 DETAILED PROGRAM FOR THURSDAY 11 APRIL Similarity Query Processing for Probabilistic Sets Ming Gao, Cheqing Jin (East China Normal University) Wei Wang (University of New South Wales) Xuemin Lin (East China Normal University / University of New South Wales) Aoying Zhou (East China Normal University) Evaluating similarity between sets is a fundamental task in computer science. However, there are many applications in which elements in a set may be uncertain due to various reasons. Existing work on modeling probabilistic sets and computing their similarities suffers from exponentially large model sizes or prohibitively costly similarity computation, and hence is only applicable to tiny probabilistic sets. In this paper, we propose a simple yet expressive model that supports many applications where one probabilistic set may have thousands of, or hundreds of thousands of elements. We define two types of similarities between two probabilistic sets using the possible world semantics; they complement each other in capturing the similarity distributions in the cross product of possible worlds. We design efficient dynamic programming-based algorithms to calculate both types of similarities. Novel individual and batch pruning techniques based on upper bounding the similarity values are also proposed. To accommodate extremely large probabilistic sets, we also design sampling-based approximate query processing methods with strong probabilistic guarantees. We have conducted extensive experiments using both synthetic and real datasets, and demonstrated the effectiveness and efficiency of our proposed methods. THU/11 Top-k String Similarity Search with Edit-Distance Constraints 98 Dong Deng, Guoliang Li, Jianhua Feng (Tsinghua University) Wen-Syan Li (SAP Labs Shanghai) String similarity search is a fundamental operation in many areas, such as data cleaning, information retrieval, and bioinformatics. In this paper we study the problem of top-k string similarity search with edit-distance constraints, which, given a collection of strings and a query string, returns the top-k strings with the smallest edit distances to the query string. Existing methods usually try different edit-distance thresholds and select an appropriate threshold to find top-k answers. However it is rather expensive to select an appropriate threshold. To address this problem, we propose a progressive framework by improving the traditional dynamicprogramming algorithm to compute edit distance. We prune unnecessary entries in the dynamic programming matrix and only compute those pivotal entries. We extend our techniques to support top-k similarity search. We develop a rangebased method by grouping the pivotal entries to avoid duplicated computations. Experimental results show that our method achieves high performance, and significantly outperforms state-of-the-art approaches on real-world datasets. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Research 27: Shortest and Direct Query 1:30 - 3PM Chair: Gao Cong (NTU) Bastille 2 On Shortest Unique Substring Queries Jian Pei (Simon Fraser University) Wush Chi-Hsuan Wu, Mi-Yen Yeh (Academia Sinica Taiwan) In this paper, we tackle a novel type of interesting queries — shortest unique substring queries. Given a (long) string S and a query point q in the string, can we find a shortest substring containing q that is unique in S? We illustrate that shortest unique substring queries have many potential applications, such as information retrieval, bioinformatics, and event context analysis. We develop efficient algorithms for online query answering. First, we present an algorithm to answer a shortest unique substring query in O(n) time using a suffix tree index, where n is the length of string S. Second, we show that, using O(n · h) time and O(n) space, we can compute a shortest unique substring for every position in a given string, where h is variable theoretically in O(n) but on real data sets often much smaller than n and can be treated as a constant. Once the shortest unique substrings are precomputed, shortest unique substring queries can be answered online in constant time. In addition to the solid algorithmic results, we empirically demonstrate the effectiveness and efficiency of shortest unique substring queries on real data sets. Engineering Generalized Shortest Path Queries ICDE 2013 Conference THU/11 Michael N. Rice, Vassilis J. Tsotras (University of California, Riverside) Generalized Shortest Path (GSP) queries represent a variant of constrained shortest path queries in which a solution path of minimum total cost must visit at least one location from each of a set of specified location categories (e.g., gas stations, grocery stores) in a specified order. This problem type has many practical applications in logistics and personalized location-based services, and is closely related to the NP-hard Generalized Traveling Salesman Path Problem (GTSPP). In this work, we present a new dynamic programming formulation to highlight the structure of this problem. Using this formulation as our foundation, we progressively engineer a fast and scalable GSP query algorithm for use on large, real-world road networks. Our approach incorporates concepts from Contraction Hierarchies, a well-known graph indexing technique for static shortest path queries. To demonstrate the practicality of our algorithm we experimented on the North American road network (with over 50 million edges) where we achieved up to several orders of magnitude speed improvements over the previous-best algorithm, depending on the relative sizes of the location categories. 99 DETAILED PROGRAM FOR THURSDAY 11 APRIL Efficient Direct Search on Compressed Genomic Data Xiaochun Yang, Bin Wang (Northeastern University) Chen Li (University of California, Irvine) Jiaying Wang (Northeastern University) Xiaohui Xie (University of California, Irvine) The explosive growth in the amount of data produced by next-generation sequencing poses significant computational challenges on how to store, transmit and query these data, efficiently and accurately. A unique characteristic of the genomic sequence data is that many of them can be highly similar to each other, which has motivated the idea of compressing sequence data by storing only their differences to a reference sequence, thereby drastically cutting the storage cost. However, an unresolved question in this area is whether it is possible to perform search directly on the compressed data, and if so, how. Here we show that directly querying compressed genomic sequence data is possible and can be done efficiently. We describe a set of novel index structures and algorithms for this purpose, and present several optimization techniques to reduce the space requirement and query response time. We demonstrate the advantage of our method and compare it against existing ones through a thorough experimental study on real genomic data. Research 28: Skyline and Snapshot Query 1:30 - 3PM Chair: Reynold Cheng (University of Hong Kong) Concorde THU/11 On Answering Why-not Questions in Reverse Skyline Queries 100 Md. Saiful Islam, Rui Zhou, Chengfei Liu (Swinburne University of Technology) This paper aims at answering the so called why-not questions in reverse skyline queries. A reverse skyline query retrieves all data points whose dynamic skylines contain the query point. We outline the benefit and the semantics of answering why-not questions in reverse skyline queries. In connection with this, we show how to modify the why-not point and the query point to include the why-not point in the reverse skyline of the query point. We then show, how a query point can be positioned safely anywhere within a region (i.e., called safe region) without losing any of the existing reverse skyline points. We also show how to answer why-not questions considering the safe region of the query point. Our approach efficiently combines both query point and data point modification techniques to produce meaningful answers. Experimental results also demonstrate that our approach can produce high quality explanations for why-not questions in reverse skyline queries. Layered Processing of Skyline-Window-Join (SWJ) Queries using Iteration-Fabric Mithila Nagendra, K. Selçuk Candan (Arizona State University) The problem of finding interesting tuples in a data set, more commonly known as the skyline problem, has been extensively studied in scenarios where the data is static. More recently, skyline research has moved towards data streaming environments, where tuples arrive/expire in a continuous manner. Several algorithms ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL have been developed to track skyline changes over sliding windows; however, existing methods focus on skyline analysis in which all required skyline attributes belong to a single incoming data stream. This constraint renders current algorithms unsuitable for applications that require a real-time “join” operation to be carried out between multiple incoming data streams, arriving from different sources, before the skyline query can be answered. Based on this motivation, in this paper, we address the problem of computing skyline-window-join (SWJ) queries over pairs of data streams, considering sliding windows that take into account only the most recent tuples. In particular, we propose a Layered Skyline-window-Join (LSJ) operator that (a) partitions the overall process into processing layers and (b) maintains skylinejoin results in an incremental manner by continuously monitoring the changes in all layers of the process. We combine the advantages of existing skyline methods (including those that efficiently maintain skyline results over a single stream, and those that compute the skyline of pairs of static data sets) to develop a novel iteration-fabric skyline-window-join processing structure. Using the iteration-fabric, LSJ eliminates redundant work across consecutive windows by leveraging shared data across all iteration layers of the windowed skyline-join processing. To the best of our knowledge, this is the first paper that addresses join-based skyline queries over sliding windows. Extensive experimental evaluations over real and simulated data show that LSJ provides large gains over naive extensions of existing schemes which are not designed to eliminate redundant work across multiple processing layers. Efficient Snapshot Retrieval over Historical Graph Data ICDE 2013 Conference THU/11 Udayan Khurana, Amol Deshpande (University of Maryland) We present a distributed graph database system to manage historical data for large evolving information networks, with the goal to enable temporal and evolutionary queries and analysis. The cornerstone of our system is a novel, user- extensible, highly tunable, and distributed hierarchical index structure called “DeltaGraph”, that enables compact recording of the historical network information, and that supports efficient retrieval of historical graph snapshots for single-site or parallel processing. Our system exposes a general programmatic API to process and analyze the retrieved snapshots. Along with the original graph data, DeltaGraph can also maintain and index “auxiliary” information; this functionality can be used to extend the structure to efficiently execute queries like “subgraph pattern matching” over historical data. We develop analytical models for both the storage space needed and the snapshot retrieval times to aid in choosing the right construction parameters for a specific scenario. We also present an in-memory graph data structure called “GraphPool” that can maintain hundreds of historical graph instances in main memory in a non-redundant manner. We present a comprehensive experimental evaluation that illustrates the effectiveness of our proposed techniques at managing historical graph information. 101 DETAILED PROGRAM FOR THURSDAY 11 APRIL Seminar 8: Shallow Information Extraction for the Knowledge Web 1:30 - 3PM Odeon Denilson Barbosa (University of Alberta) Haixun Wang (Microsoft Research Asia) Cong Yu (Google Inc.) A new breed of Information Extraction tools has become popular and shown to be very effective in building massive-scale knowledge bases that fuel applications such as question answering and semantic search. These approaches rely on Web-scale probabilistic models populated through shallow language processing of the text, preexisting knowledge, and structured data already on the Web. This tutorial provides an introduction to these techniques, starting from the foundations of information extraction, and covering some of its key applications. Demo Groups 3 & 4 1:30 - 3PM See Demo Groups 3 & 4 on (p. 90) for demonstration details. Ballroom 3 Research 29: Large Graph Indexing 3:30 - 5PM Chair: James Cheng (The Chinese University of Hong Kong) St Germaine THU/11 FERRARI: Flexible and Efficient Reachability Range Assignment for Graph Indexing 102 Stephan Seufert, Avishek Anand (Max Planck Institute for Informatics) Srikanta Bedathur (IIIT Delhi) Gerhard Weikum (Max Planck Institute for Informatics) In this paper, we propose a scalable and highly efficient index structure for the reachability problem over graphs. We build on the well-known node interval labeling scheme where the set of vertices reachable from a particular node is compactly encoded as a collection of node identifier ranges. We impose an explicit bound on the size of the index and flexibly assign approximate reachability ranges to nodes of the graph such that the number of index probes to answer a query is minimized. The resulting tunable index structure generates a better range labeling if the space budget is increased, thus providing a direct control over the trade off between index size and the query processing performance. By using a fast recursive querying method in conjunction with our index structure, we show that in practice, reachability queries can be answered in the order of microseconds on an off-the-shelf computer - even for the case of massive-scale real world graphs. Our claims are supported by an extensive set of experimental results using a multitude of benchmark and real-world web-scale graph datasets. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL gIceberg: Towards Iceberg Analysis in Large Graphs Nan Li, Ziyu Guan, Lijie Ren (University of California at Santa Barbara) Jian Wu (Zhejiang University) Jiawei Han (University of Illinois at Urbana-Champaign) Xifeng Yan (University of California at Santa Barbara) Traditional multi-dimensional data analysis techniques such as iceberg cube cannot be directly applied to graphs for finding interesting or anomalous vertices due to the lack of dimensionality in graphs. In this paper, we introduce the concept of graph icebergs that refer to vertices for which the concentration (aggregation) of an attribute in their vicinities is abnormally high. Intuitively, these vertices shall be “close” to the attribute of interest in the graph space. Based on this intuition, we propose a novel framework, called gIceberg, which performs aggregation using random walks, rather than traditional SUM and AVG aggregate functions. This proposed framework scores vertices by their different levels of interestingness and finds important vertices that meet a user-specified threshold. To improve scalability, two aggregation strategies, forward and backward aggregation, are proposed with corresponding optimization techniques and bounds. Experiments on both real-world and synthetic large graphs demonstrate that gIceberg is effective and scalable. Top-k Graph Pattern Matching over Large Graphs ICDE 2013 Conference THU/11 Jiefeng Cheng (Chinese Academy of Sciences / Shenzhen Key Laboratory of High Performance Data Mining) Xianggang Zeng (Chinese Academy of Sciences) Jeffrey Xu Yu (The Chinese University of Hong Kong) There exist many graph-based applications including bioinformatics, social science, link analysis, citation analysis, and collaborative work. All need to deal with a large data graph. Given a large data graph, in this paper, we study finding top-k answers for a graph pattern query, and in particular, we focus on top-k cyclic graph queries where a graph query is cyclic and can be complex. The capability of supporting top-k graph pattern matching (kGPM) over a data graph provides much more flexibility for a user to search graphs. And the problem itself is challenging. In this paper, we propose a new framework of processing kGPM with on-the-fly ranked lists based on spanning trees of the cyclic graph query. We observe a multidimensional representation for using multiple ranked lists to answer a given kGPM query. Under this representation, we propose a cost model to estimate the least number of tree answers to be consumed in each ranked list in order to answer a given kGPM query Q. This leads to a query optimization approach for kGPM processing, and a top-k algorithm to process kGPM with the optimal query plan. We conducted extensive performance studies using a real dataset, and we confirm the efficiency of our proposed approach. 103 DETAILED PROGRAM FOR THURSDAY 11 APRIL Research 30: Web Data 3:30 - 5PM Bastille 1 Breaking the Top-k Barrier of Hidden Web Databases Saravanan Thirumuruganathan (University of Texas at Arlington) Nan Zhang (George Washington University) Gautam Das (University of Texas at Arlington / Qatar Computing Research Institute) A large number of web databases are only accessible through proprietary formlike interfaces which require users to query the system by entering desired values for a few attributes. A key restriction enforced by such an interface is the top-k output constraint - i.e., when there are a large number of matching tuples, only a few (top-k) of them are preferentially selected and returned by the website, often according to a proprietary ranking function. Since most web database owners set k to be a small value, the top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. In this paper we consider the novel problem of “digging deeper” into such web databases. Our main contribution is the meta-algorithm GetNext that can retrieve the next ranked tuple from the hidden web database using only the restrictive interface of a web database without any prior knowledge of its ranking function. This algorithm can then be called iteratively to retrieve as many top ranked tuples as necessary. We develop principled and efficient algorithms that are based on generating and executing multiple reformulated queries and inferring the next ranked tuple from their returned results. We provide theoretical analysis of our algorithms, as well as extensive experimental results over synthetic and real-world databases that illustrate the effectiveness of our techniques. THU/11 Automatic Extraction of Top-k Lists from the Web 104 Zhixian Zhang, Kenny Q, Zhu (Shanghai Jiao Tong University) Haixun Wang, Hongsong Li (Microsoft Research Asia) This paper is concerned with information extraction from top-k web pages, which are web pages that describe top k instances of a topic, and usually the topic is of general interest. Examples include ‘the 10 tallest buildings in the world”, “the 50 hits of 2010 you don’t want to miss”, etc. Compared to other structured information on the web (including web tables), information in top-k lists is larger and richer, of higher quality, and generally more interesting. Therefore top-k lists are highly valuable. For example, it can help enrich open-domain knowledge bases (to support applications such as search or fact answering). In this paper, we present an efficient method that extracts top-k lists from web pages with high performance. Specifically, we extract more than 1.6 million top-k lists from a web corpus of 1.7 billion pages with 92% precision and 72% recall. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Finding Interesting Correlations with Conditional Heavy Hitters Katsiaryna Mirylenka, Themis Palpanas (University of Trento) Graham Cormode, Divesh Srivastava (AT&T Labs - Research) The notion of heavy hitters---items that make up a large fraction of the population--has been successfully used in a variety of applications across sensor and RFID monitoring, network data analysis, event mining, and more. Yet this notion often fails to capture the semantics we desire when we observe data in the form of correlated pairs. Here, we are interested in items that are conditionally frequent: when a particular item is frequent within the context of its parent item. In this work, we introduce and formalize the notion of Conditional Heavy Hitters to identify such items. We introduce several streaming algorithms, which allow us to find conditional heavy hitters efficiently, and provide analytical results. Different algorithms are successful for different input characteristics. We perform an experimental evaluation to demonstrate the efficacy of our methods, and to study which algorithms are most suited for different types of data. Research 31: Query Optimization 3:30 - 5PM Chair: Fabian Hueske (TU Berlin) Bastille 2 Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable? ICDE 2013 Conference THU/11 Wentao Wu (University of Wisconsin-Madison) Yun Chi, Shenghuo Zhu, Junichi Tatemura, Hakan Hacıgümüş (NEC Laboratories America) Jeffrey Naughton (University of Wisconsin Madison) Predicting query execution time is useful in many database management issues including admission control, query scheduling, progress monitoring, and system sizing. Recently the research community has been exploring the use of statistical machine learning approaches to build predictive models for this task. An implicit assumption behind this work is that the cost models used by query optimizers are insufficient for query execution time prediction. In this paper we challenge this assumption and show while the simple approach of scaling the optimizer’s estimated cost indeed fails, a properly calibrated optimizer cost model is surprisingly effective. However, even a well-tuned optimizer cost model will fail in the presence of errors in cardinality estimates. Accordingly we investigate the novel idea of spending extra resources to refine estimates for the query plan after it has been chosen by the optimizer but before execution. In our experiments we find that a well calibrated query optimizer model along with cardinality estimation refinement provides a low overhead way to provide estimates that are always competitive and often much better than the best reported numbers from the machine learning approaches. 105 DETAILED PROGRAM FOR THURSDAY 11 APRIL Query Optimization for Differentially Private Data Management Systems Shangfu Peng (University of Maryland) Yin Yang, Zhenjie Zhang (Advanced Digital Sciences Center) Marianne Winslett (Advanced Digital Sciences Center / University of Illinois at Urbana-Champaign) Yong Yu (Shanghai Jiao Tong University) Differential privacy (DP) enables publishing the results of statistical queries over sensitive data, with rigorous privacy guarantees, and very conservative assumptions about the adversary’s background knowledge. This paper focuses on the interactive DP framework, which processes incoming queries on the fly, each of which consumes a portion of the user-specified privacy budget. Existing systems process each query independently, which often leads to considerable privacy budget waste and consequently fast exhaustion of the total budget. Motivated by this, we propose Pioneer, a query optimizer for an interactive, DP-compliant DBMS. For each new query, Pioneer creates an execution plan that combines past query results and new results from the underlying data. When a query has multiple semantically equivalent plans, Pioneer automatically selects one with minimal privacy budget consumption. Extensive experiments confirm that Pioneer achieves significant savings of the privacy budget, and can answer many more queries than existing systems for a fixed total budget, with comparable result accuracy. THU/11 Top Down Plan Generation: From Theory to Practice 106 Pit Fender, Guido Moerkotte (University of Mannheim) Finding the optimal execution order of join operations is a crucial task of today’s cost-based query optimizers. There are two approaches to identify the best plan: bottom- up and top-down join enumeration. But only the top-down approach allows for branch-and-bound pruning, which can improve compile time by several orders of magnitude while still preserving optimality. For both optimization strategies, efficient enumeration algorithms have been published. However, there are two severe limitations for the top-down approach: The published algorithms can handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries may contain complex join predicates involving more than two relations, and outer joins as well as other non-inner joins, efficient top-down join enumeration cannot be used in practice yet. We develop a novel top-down join enumeration algorithm that overcomes these two limitations. Furthermore, we show that our new algorithm is competitive when compared with the state of the art in bottom-up processing even without playing out its advantage by making use of its branch-and-bound pruning capabilities. ICDE 2013 Conference DETAILED PROGRAM FOR THURSDAY 11 APRIL Research 32: Data Storage 3:30 - 5PM Chair: Mohamed Sharaf (University of Queensland) Concorde TBF: A Memory-Efficient Replacement Policy for Flash-based Caches Cristian Ungureanu, Biplob Debnath, Steve Rago, Akshat Aranya (NEC Laboratories America) The performance and capacity characteristics of flash storage make it attractive to use as a cache. Recency-based cache replacement policies rely on an in-memory full index, typically a B-tree or a hashtable, that maps each object to its recency information. Even though the recency information itself may take very little space, the full index for a cache holding N keys requires at least log N bits per key. This metadata overhead is undesirably high when used for very large flash-based caches, such as key-value stores with billions of objects. To solve this problem, we propose a new recency-based RAM-frugal cache replacement policy that approximates the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters (TBF) for maintaining the recency information and leverages an on-flash key-value store to cache objects. TBF requires only one byte of RAM per cached object, making it suitable for implementing very large flash-based caches. We evaluate TBF through simulation on traces from several block stores and key-value stores, as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system implementation. Evaluation results show that TBF achieves cache hit rate and operations per second comparable to those of LRU in spite of its much smaller memory requirements. Fast Peak-to-Peak Behavior with SSD Buffer Pool ICDE 2013 Conference THU/11 Jaeyoung Do (University of Wisconsin -Madison) Donghui Zhang (Paradigm4) Jignesh M. Patel (University of Wisconsin-Madison) David DeWitt (Microsoft Jim Gray Systems Lab) A promising use of flash SSDs in a DBMS is to extend the main memory buffer pool by caching selected pages that have been evicted from the buffer pool. Such a use has been shown to produce significant gains in the steady state performance of the DBMS. One strategy for using the SSD buffer pool is to throw away the data in the SSD when the system is restarted (either when recovering from a crash or restarting after a shutdown), and consequently a long “ramp-up” period to regain peak performance is needed. One approach to eliminate this limitation is to use a memory-mapped file to store the SSD buffer table in order to be able to restore its contents on restart. However, this design can result in lower sustained performance, because every update to the SSD buffer table may incur an I/O operation to the memory-mapped file. In this paper we propose two new alternative designs. One design reconstructs the SSD buffer table using transactional logs. The other design asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the integrity of the data cached in the SSD buffer pool. We have implemented these 107 DETAILED PROGRAM FOR THURSDAY 11 APRIL three designs in SQL Server 2012. For each design, both the write-through and write-back SSD caching policies were implemented. Using two OLTP benchmarks (TPC-C and TPC-E), our experimental results show that our designs produce up to 3.8X speedup on the interval between peak-to-peak performance, with negligible performance loss; in contrast, the previous approach has a similar speedup but up to 54% performance loss. SELECT Triggers for Data Auditing Daniel Fabbri (University of Michigan) Ravi Ramamurthy, Raghav Kaushik (Microsoft Research) Auditing is a key part of the security infrastructure in a database system. While commercial database systems provide mechanisms such as triggers that can be used to track and log any changes made to ``sensitive’’ data using UPDATE queries, they are not useful for tracking accesses to sensitive data using complex SQL queries, which is important for many applications given recent laws such as HIPAA. In this paper, we propose the notion of SELECT triggers that extends triggers to work for SELECT queries in order to facilitate data auditing. We discuss the challenges in integrating SELECT triggers in a database system including specification, semantics as well as efficient implementation techniques. We have prototyped our framework in a commercial database system and present an experimental evaluation of our framework using the TPC-H benchmark. THU/11 Seminar 9: Secure and Privacy-Preserving Database Services in the Cloud 108 3:30 - 5PM Odeon Divyakant Agrawal, Amr El Abbadi, Shiyuan Wang (University of California, Santa Barbara) Cloud computing becomes a very successful paradigm for data computing and storage. Increasing concerns about data security and privacy in the cloud, however, have arisen. Ensuring security and privacy for data management and query processing in the cloud is critical for better and broader uses of the cloud. This tutorial covers recent research on cloud security and privacy, while focusing on the works that protect data confidentiality and query access privacy for sensitive data being stored and queried in the cloud. We provide a comprehensive study of stateof-the-art schemes and techniques for protecting data confidentiality and access privacy, and explain their tradeoffs in security, privacy, functionality and performance. Poster Session 3:30 - 6PM Ballroom 1 & 2 ICDE 2013 Conference TRANSPORT How to travel to/from airport and venue Train The Airtrain runs to/from Brisbane Domestic and International Airports, with travel time of just 22 minutes to Central Station. A one-way single adult ticket costs AUD$15.00. For the Airtrain timetable please visit translink.com.au Weekends: Airport to Central Station: First train 6am – Last train 10pm Central Station to Airport: First train 5am – Last train 9pm Mon-Fri: Airport to Central Station: First train 5.40am – Last train 10pm Central Station to Airport: First train 5am – Last train 9pm Taxi Fares vary due to distance, traffic conditions and time, however, you can anticipate that a fare to / from Brisbane’s CBD and Brisbane Airport will total approximately AUD$40.00. Public Transport in Brisbane – Buses, Trains and Ferries For timetables, journey planner and other details, go to: translink.com.au go card is TransLink's electronic ticket. It allows you to travel seamlessly on all bus, train and ferry services. Buy your go card at train stations, most news agencies or online via https://gocard.translink.com.au/webtix/ ICDE 2013 Conference 109 SOCIAL PROGRAM IEEE TCDE Members Reception Monday 8 April 7 - 9 PM at the Summit Restaurant, 1012 Sir Samuel Griffith Drive, Brisbane Lookout, Mt Coot-tha 6:30pm bus pickup from ICDE conference hotel (Sofitel), and departure from restaurant back to Sofitel at 9.15pm. New and current TCDE members are invited to join us for the 2013 TCDE Members Reception. Welcome Reception Tuesday 9 April 5:30 - 7 PM in the Ann Street Lobby, Sofitel Hotel Banquet Wednesday 10 April 6:30 - 10 PM at Brisbane City Hall Main Auditorium, King George Square, Brisbane CBD The Brisbane City Hall is located 600m, or an east 8 minute walk, from the Sofitel Hotel. Directions available from the Registration Desk. Posters & Farewell Drinks Thursday 11 April Posters: 3.30 – 6pm with drinks from 5 - 6 PM in Ballroom 1 & 2, Sofitel Hotel 110 ICDE 2013 Conference REGISTRATION & INFORMATION DESK The registration & information desk for the conference is located in the Ann Street Lobby of the Sofitel Hotel. The information desk will be open at the following times: Sunday 7 April: 4 PM – 8 PM Monday 8 April: 7 AM – 9 PM Tuesday 9 April: 7 AM – 7 PM Wednesday 10 April: 8 AM – 6 PM Thursday 11 April: 8 AM – 6PM Event Coordinator: Kathleen Williamson Phone: 0401 477 509 Email: [email protected] Volunteers Volunteers will be available to help with any questions during the conference. They may be identified by their black ICDE-13 shirts. ICDE-13 Student Travel Award Winners To claim the AUD600 award, please email [email protected], or visit the Registration Desk during the conference. Internet Access Free Wi-Fi Internet access is provided on the conference floor for delegates. For access details please visit the Registration Desk. Handy Brisbane Apps Including AirTrain, bikes, taxis, public transport, maps, news, weather and food: www.brisbanemarketing.com.au/Resources/Convention-Support-Toolkit/pages/ Delegate-Experience/Handy-Brisbane-Tourist-Apps ICDE 2014 ICDE 2014 will be held from 07 Apr - 11 Apr 2014 at the Intercontinental Marriott Downtown, 540 North Michigan Avenue Chicago, IL, USA. Please contact Goce Trajcevski - [email protected] - Northwestern University ICDE 2013 Conference 111 VOLUNTEERS ICDE-13 would like to extend our warm appreciation to our conference volunteers who assisted before, during and after the conference, to help make sure that everyone enjoys a great conference experience. These volunteers welcome participants, give directions, help in the sessions and on the registration desk, and generally make sure the conference is running smoothly. At the conference, volunteers may be identified by their black ICDE-13 shirts. Chao Gu, The University of Queensland Guanfeng Liu, Macquarie University Hamed Hassanzadeh, The University of Queensland Han Ada Su, The University of Queensland Haozhou Wang, The University of Queensland Hongyun Cai, The University of Queensland Jiajie Yue, The University of Queensland Jiping Tracy Wang, The University of Queensland Kun Zhao, The University of Queensland Liangchen Liu, The University of Queensland Litao Yu, The University of Queensland Marina Drosou, University of Ioannina Mukhammad Andri Setiawan, The University of Queensland Sayan Unankard, The University of Queensland Vinita Nahar, The University of Queensland Xuefei Li, The University of Queensland Yunfei Shi, UQ Business School 112 ICDE 2013 Conference ICDE 2013 COMMITTEES Organizing Committee General Chairs Rao Kotagiri (The University of Melbourne, Australia) Beng Chin Ooi (National University of Singapore, Singapore) Program Chairs Christian S. Jensen (Aarhus University, Denmark) Chris Jermaine (Rice University, USA) Xiaofang Zhou (The University of Queensland, Australia) Workshop Chairs Chee Yong Chan (National University of Singapore, Singapore) Kjetil Nørvåg (Norwegian University of Science and Technology, Norway) Proceedings Chairs Jiaheng Lu (Renmin University of China, China) Egemen Tanin (The University of Melbourne, Australia) Industry Chairs Sang Cha (Seoul National University, Korea) Haixun Wang (Microsoft Research Asia, China) Ph.D. Symposium Chairs Gottfried Vossen (University of Münster, Germany) Min Wang (HP Labs China, China) Seminar Chair Alexandros Labrinidis (University of Pittsburgh, USA) Panel Chairs Dimitrios Georgakopoulos (CSIRO, Australia) Jun Yang (Duke University, USA) Demo Chairs Yoshiharu Ishikawa (Nagoya University, Japan) Rui Zhang (The University of Melbourne, Australia) Yanchun Zhang (Victoria University, Australia) Poster Chair ICDE 2013 Conference 113 Wook-Shin Han (Kyungpook National University, Korea) Local Organization Chairs Shazia Sadiq (The University of Queensland, Australia) Heng Tao Shen (The University of Queensland, Australia) Finance Chair Marta Indulska (The University of Queensland, Australia) Web and Publicity Chair Mohamed Sharaf (The University of Queensland, Australia) Program Committee Program Chairs Christian S. Jensen (Aarhus University, Denmark) Chris Jermaine (Rice University, USA) Xiaofang Zhou (The University of Queensland, Australia) Track Chairs Wolfgang Lehner (Dresden University of Technology, Germany) Data warehousing, analytics, MapReduce, and big data Xin (Luna) Dong (AT&T Labs - Research, USA) Data Integration, metadata management, interoperability Jian Pei (Simon Fraser University, Canada) Data Mining and knowledge discovery: algorithms Srinivasan Parthasarathy (Ohio State University, USA) Data Mining and knowledge discovery: applications Panos Chrysanthis (University of Pittsburgh, USA) Cloud infrastructure, mobile, distributed, and peer-to-peer data management Paul Larson (Microsoft Research, USA) Indexing and Storage Pierangela Samarati (University of Milan, Italy) Privacy and Security Amol Deshpande (University of Maryland, USA) Query processing and query optimization Magda Balazinska (University of Washington, USA) Scientific data and data visualization Jeffrey Xu Yu (The Chinese University of Hong Kong, China) Semistructured data, RDF, XML 114 ICDE 2013 Conference ICDE 2013 COMMITTEES Xifeng Yan (University of California at Santa Barbara, USA) Social networks, web, and personal information management Simonas Saltenis (Aalborg University, Denmark) Spatial, temporal, and multimedia data Nesime Tatbul (ETH Zurich, Switzerland) Streams, sensor networks, and complex events processing Alan Fekete (The University of Sydney, Australia) Systems, performance, and transaction management Vagelis Hristidis (University of California at Riverside, USA) Text, graphs, and search Xuemin Lin (The University of New South Wales, Australia) Uncertain and probabilistic data Research Program Committee Members Karl Aberer, EPFL, Switzerland Ashraf Aboulnaga, University of Waterloo, Canada Yanif Ahmad, Johns Hopkins University, USA Gustavo Alonso, ETH Zurich, Switzerland Walid Aref, Purdue University, USA Ismail Ari, Ozyegin University, Turkey Ira Assent, Aarhus University, Denmark Sitaram Asur, HP Research, USA Shivnath Babu, Duke University, USA Torben Bach Pedersen, Aalborg University, Denmark James Bailey, The University of Melbourne, Australia Phil Bernstein, Microsoft Research, USA Sourav Saha Bhowmick, Nanyang Technological University, Singapore Peter Boncz, CWI, The Netherlands K. Selcuk Candan, Arizona State University, USA Kaushik Chakrabarti, Microsoft Research, USA ICDE 2013 Conference Badrish Chandramouli, Microsoft Research, USA Kevin Chang, University of Illinois at Urbana-Champaign, USA Lijun Chang, Chinese University of Hong Kong, China Sanjay Chawla, The University of Sydney, Australia Lei Chen, Hong Kong University of Science and Technology, China Shimin Chen, HP Labs China, China Yi Chen, Arizona State University, USA Hong Cheng, City University of Hong Kong, China James Cheng, Nanyang Technological University, Singapore Reynold Cheng, University of Hong Kong, China Tao Cheng, Microsoft Research, USA Paolo Ciaccia, University of Bologna, Italy Graham Cormode, AT&T Labs Research, USA Bin Cui, Peking University, China 115 ICDE 2013 COMMITTEES Judith Cushing, The Evergreen State College, USA Gautam Das, UT Arlington & QCRI, USA Sudipto Das, Microsoft Research, USA Khuzaima Daudjee, University of Waterloo, Canada Sabrina De Capitani di Vimercati, Universita' degli Studi di Milano, Italy Alex Delis, University of Athens, Greece Josep Domingo-Ferrer, Universitat Rovira i Virgili, Spain Sameh Elnikety, Microsoft Research, USA Ling Feng, Tsinghua University, China Peter Fischer, University of Freiburg, Germany George Fletcher, Eindhoven University of Technology, The Netherlands Johann Christoph Freytag, HumboldtUniversitaet zu Berlin, Germany Keith Frikken, Miami University, USA Tingjian Ge, University of Massachusetts at Lowell, USA Bugra Gedik, Bilkent University, Turkey Gabriel Ghinita, University of Massachusetts at Boston, USA Amol Ghoting, IBM Research, USA Aristides Gionis, Yahoo! Research, USA Lukasz Golab, University of Waterloo, Canada Ralf Hartmut Güting, Fernuniversitat Hagen, Germany Hakan Hacigumus, NEC Labs, USA Jiawei Han, University of Illinois at UrbanaChampaign, USA Jan Hidders, Delft University of Technology, The Netherlands 116 Bill Howe, University of Washington, USA Helen Huang, The University of Queensland, Australia Ihab Ilyas, Qatar Computing Research Institute, Qatar Raghav Kaushik, Microsoft Research, USA Bettina Kemme, McGill University, Canada Martin Kersten, CWI Amsterdam, The Netherlands Nick Koudas, University of Toronto, Canada Georgia Koutrika, HP Labs, USA Tim Kraska, University of California Berkeley, USA Peer Kröger, LMU Munich, Germany Harumi Kuno, HP Labs, USA Ashwin Lall, Denison University, USA Adam Lee, University of Pittsburgh, USA Akoglu Leman, Carnegie Mellon, USA Chen Li, University of California at Irvine, USA Chengkai Li, University of Texas at Arlington, USA Feifei Li, University of Utah, USA Guoliang Li, Tsinghua University, China Tao Li, Florida International University, USA Chengfei Liu, Swinburne University of Technology, Australia David Lomet, Microsoft Research, USA Hua Lu, Aalborg University, Denmark Qiong Luo, Hong Kong University of Science and Technology, China Shuai Ma, Beihang University, China Nikos Mamoulis, University of Hong Kong, China ICDE 2013 Conference ICDE 2013 COMMITTEES Yannis Manolopoulos, Aristotle University of Thessaloniki, Greece Claudia Medeiros, University of Campinas, Brazil Sharad Mehrotra, University of California at Irvine, USA Alexandra Meliou, University of Washington, USA Mohamed Mokbel, University of Minnesota, USA Bongki Moon, University of Arizona, USA Barzan Mozafari, MIT, USA Arnab Nandi, Ohio State University, USA Mario Nascimento, University of Alberta, Canada Thomas Neumann, Technische Universitat Munchen, Germany Raymond Ng, University of British Columbia, Canada Alexandros Ntoulas, UCLA, USA Mitsunori Ogihara, University of Miami, USA Dan Olteanu, Oxford University, UK Carlos Ordonez, University of Houston, USA Ippokratis Pandis, IBM Research, USA Spiros Papadimitriou, Google, USA Olga Papaemmanouil, Brandeis University, USA Stefano Paraboschi, Universita' degli Studi di Bergamo, Italy Marta Patiño-Martínez, Universidad Politécnica de Madrid, Spain Peter Pietzuch, Imperial College London, UK Evaggelia Pitoura, University of Ioannina, Greece ICDE 2013 Conference Rachel Pottinger, University of British Columbia, Canada Lu Qin, Chinese University of Hong Kong, China Venkatesh Raghavan, Greenplum/EMC, USA Jorge-Arnulfo Quiané-Ruiz, University of Saarland, Germany Christopher Re, University of Wisconsin, USA Matthias Renz, Ludwig-Maximilians University Munich, Germany Florin Rusu, University of California, Merced, USA Kai-Uwe Sattler, Ilmenau University of Technology, Germany Venu Satuluri, Twitter, USA Thomas Seidl, RWTH Aachen University, Germany Sudipta Sengupta, Microsoft Research, USA Mohamed Sharaf, The University of Queensland, Australia Jialie Shen, Singapore Management University, Singapore Yasin Silva, Arizona State University, USA Manas Somaiya, eBay Inc, USA Julia Stoyanovich, University of Pennsylvania, USA Kian-Lee Tan, National University of Singapore, Singapore Nan Tang, Qatar Computing Research Institute, Qatar Yufei Tao, Chinese University of Hong Kong, China Arash Termehchy, University of Illinois at Urbana-Champaign, USA Evimaria Terzi, Boston University, USA 117 ICDE 2013 COMMITTEES Jens Teubner, ETH Zurich, Switzerland Hanghang Tong, IBM Research, USA Vincent Tseng, National Cheng Kung University, Taiwan Kostas Tzoumas, Technical University of Berlin, Germany Marcos Vaz Salles, University of Copenhagen, Denmark Akrivi Vlachou, Athens University of Economics and Business, Greece Jianyong Wang, Tsinghua University, China Wei Wang, University of North Carolina, USA Yuqing Melanie Wu, Indiana University, USA Hui Xiong, Rutgers University, USA Fei Xu, Microsoft Search, USA Bin Yang, Aarhus University, Denmark Yin Yang, Advanced Digital Sciences Center, USA Mi-yen Yeh, SINICA, Taiwan Man Lung Yiu, Hong Kong Polytechnic University, China Hwanjo Yu, Pohang University of Science and Technology, Korea Demetris Zeinalipour-Yazti, University of Cyprus, Cyprus Rui Zhang, The University of Melbourne, Australia Wenjie Zhang, The University of New South Wales, Australia Ying Zhang, The University of New South Wales, Australia Peixiang Zhao, University of Illinois at Urbana-Champaign, USA Zhi-Hua Zhou, Nanjing University, China Feida Zhu, Singapore Management University, Singapore Industry Program Chairs Sang Cha, Seoul National University, Korea Haixun Wang, Microsoft Research Asia, China Industry Program Committee Members Athman Bouguettaya, RMIT, Australia Brian Cooper, Google, USA Carsten Binnig, DHBW Mannheim, Germany Changkyu Kim, Intel, USA Christof Bornhoevd, SAP Labs Palo Alto, USA Fabian Suchanek, Max Planck Institute for Informatics, Germany Russell Sears, Microsoft Research, USA Sameh Elnikety, Microsoft Research, USA Vincent Tseng, National Cheng Kung University, Taiwan Xing Xie, Microsoft Research Asia, China Yan Huang, University of North Texas, USA 118 ICDE 2013 Conference ICDE 2013 COMMITTEES Yanghua Xiao, Fudan University, China Demo Program Chairs Yoshiharu Ishikawa, Nagoya University, Japan Rui Zhang, The University of Melbourne, Australia Yanchun Zhang, Victoria University, Australia Demo Program Committee Members Sourav S. Bhowmick, Nanyang Technological University, Singapore Christian Böhm, University of Munich, Germany Malu Castellanos, HP Labs, USA Wojciech Cellary, Poznan University of Economics, Poland Jidong Chen, EMC Research China Lab, China Reynold Cheng, University of Hong Kong, China Gao Cong, Nanyang Technological University, China Elena Ferrari, University of Insubria, Italy Jing He, Victoria University, Australia Zhen He, La Trobe University, Australia Mizuho Iwaihara, Waseda University, Japan Sun Kim, Seoul National University, Korea Christian Konig, Microsoft Research, USA Dan Lin, Missouri University of Science and Technology, USA Eric Lo, Hong Kong Polytechnic University, China Weiyi Meng, State University of New York at Binghamton, USA Xiaofeng Meng, Renmin University of China, China Jun Miyazaki, Nara Institute of Science and Technology, Japan Kyriakos Mouratidis, Singapore Management University, Singapore Emmanuel Müller, Karlsruhe Institute of Technology, Germany Timos Sellis, National Technical University of Athens, Greece David Taniar, Monash University, Australia Hua Wang, The University of Southern Queensland, Australia Wei Wang, The University of New South Wales, Australia Lexing Xie, Australian National University, Australia Xiaohui Yu, York University, Canada Zhenjie Zhang, Advanced Digital Sciences Center, USA ICDE 2013 Conference 119 NOTES 120 ICDE 2013 Conference NOTES ICDE 2013 Conference 121 NOTES 122 ICDE 2013 Conference NOTES ICDE 2013 Conference 123 NOTES 124 ICDE 2013 Conference visitbrisbane.com.au EVENT HIGHLIGHTS Brisbane APRIL play on Until 14 Apr The 7th Asia Pacific Triennial of Contemporary Art (APT7) QAGOMA Until Jul Until Sep Queensland Reds Season Brisbane Broncos Season 2013 Suncorp Stadium Suncorp Stadium From 6 Apr Brisbane Lions Season 2013 The Gabba MAY Sophisticated and sporty, haute and hot, Brisbane packs a lot in to a short stay. From riverside dining and farmers markets, to free laughs and a world-class line-up of events, Brisbane has you covered. LIVE MUSIC Find out what makes Brisbane tick on a daily Brisbane Greeters walking tour led by passionate and in-the-know locals. Leaving daily at 10am from the Visitor Information Centre, Queen Street Mall. The live music scene in Brisbane is well and truly aLIVE! Check out The Tivoli, Black Bear Lodge and Ric’s Bar in Fortitude Valley or The Hi-Fi in West End for live performances. FARMERS MARKETS FUN FOR FREE Jan Power’s Farmers Market is a colourful, bustling, open air market selling fresh farm produce, flowers, breads, meat, fish, poultry, plants and organics. Every Wednesday in Redacliff Place, The City. For free tunes and laughs, the Brisbane Powerhouse is your destination. Every Saturday and Sunday be entertained by comedians and musicians at Saturday Sessions and Livewired. Doomben & Eagle Farm Racecourses 30 May – 9 Jun Bolshoi Ballet Queensland Performing Arts Centre 8 & 22 Jun 16 Jun British & Irish Lions City2South Suncorp Stadium Brisbane City & South Bank 26 Jun State of Origin Suncorp Stadium From 6 Jul 4 Aug War Horse Brisbane Marathon Festival Queensland Performing Arts Centre Brisbane City Image credits: Bolshoi Ballet: Le Corsaire © Damir Yusupov. APT7: MadeIn Company / Spread 201009103 (detail) 2010 / Image courtesy: The artists. GREETERS Brisbane Racing Carnival AUGUST more to explore 11 May – 8 Jun Anywhere Theatre Festival Homes, parks, shops…anywhere J U NE Fresh breezes and sun-kissed water make very pleasant companions at some of the city’s best riverside restaurants. With interiors by Brisbane’s internationally regarded Anna Spiro, Mr & Mrs G Riverbar on Eagle Street Pier offers stunning views on the inside and out. It’s the latest place to go for cocktails, tapas and panoramic views of the Brisbane River and Story Bridge. Across the river at South Bank, the newly opened River Quay precinct includes hot new contenders Stokehouse, Popolo, The Jetty and Cove Bar. J U LY Bites by the Water 8 – 19 May Information correct at time of printing. facebook.com/visitbrisbane BRONZE SILVER GOLD PLATINUM DIAMOND THANK YOU TO OUR PATRONS AND SUPPORTERS!