Download data engineering

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
29th IEEE International Conference on
DATA ENGINEERING
BRISBANE, AUSTRALIA | 8 – 11 APRIL 2013
www.icde2013.org
IEEE Technical Committee on Data Engineering
tab.computer.org/tcde
What is the Technical Committee
on Data Engineering? The Technical
Committee on Data Engineering
(TCDE) of the IEEE Computer Society is concerned with the role of data in
the design, development, management and utilization of information systems.
The TCDE sponsors the International Conference on Data Engineering
(ICDE). It publishes a quarterly newsletter, the Data Engineering Bulletin.
There are approximately 1500 members of the TCDE.
How to join the Technical Committee? If you are a member of the IEEE
Computer Society, you may join the TCDE and receive copies of the
Data Engineering Bulletin without cost. To become a member follow the
membership form link from this page: tab.computer.org/tcde
IEEE Computer Society
Member benefits and services:
www.computer.org/portal/web/membership/join
Stay ahead of the technology curve with easy access to the most up-to-date
and advanced information in the computing world. Advance your career
with access to top e-learning courses, online books and leading publications
in your area of expertise. Network with the world’s foremost technology
professionals. Lead the community with volunteering and mentoring
opportunities that enable you to both gain exposure and contribute to the
field as an author and reviewer.
Table of Contents
Message from the ICDE 2013 Program Committee & General Chairs
2
Conference Venue
5
Conference at a Glance
7
Monday 8 April
Monday Detailed Program
20
Tuesday 9 April
Tuesday at a Glance
Tuesday Detailed Program
8
31
Wednesday 10 April
Wednesday at a Glance
Wednesday Detailed Program
12
65
Thursday 11 April
Thursday at a Glance
Thursday Detailed Program
16
82
Social Program
110
Transport Information
109
Registration & Information Desk
111
Volunteers112
ICDE 2013 Committees
ICDE 2013 Conference
113
1
Message from the ICDE 2013 Program
committee and general chairs
Established in 1984, ICDE has become a premier forum for the dissemination of data
management research results among researchers, users, practitioners, and developers.
The 29th IEEE International Conference on Data Engineering takes place in Brisbane,
QLD, Australia, from April 8 to 11, 2013. We are proud to present its proceedings.
Each of the main conference days features a keynote by a distinguished scientist: Vishal
Sikka from SAP (April 9), Alon Halevy from Google (April 10) and Gustavo Alonso
from ETH Zurich (April 11).
We thank all authors who submitted their work to ICDE for making the conference
happen. We received 443 paper submissions for the research track, 20 submissions
for the industrial track, and 69 demo proposals. The program committee was
organized into 16 topic-based areas, each of which was headed by an area chair
that was in charge of overseeing the evaluation of submissions assigned to that area.
The area chairs were: Magda Balazinska, Panos Chrysanthis, Amol Deshpande,
Xin (Luna) Dong, Alan Fekete,Vagelis Hristidis, Paul Larson, Wolfgang Lehner,
Xuemin Lin, Srinivasan Parthasarathy, Jian Pei, Simonas Saltenis, Pierangela Samarati,
Nesime Tatbul, Xifeng Yan and Jeffrey Xu Yu. Each submission was assigned to three
reviewers from the research program committee that consisted of 144 members, the
industrial program committee that consisted of 12 members, and the demo program
committee that consisted of 28 members. The evaluation process had several phases:
assignment of papers to reviewers, reviewing, discussions among reviewers, decision
making by area chairs, consolidation of decisions, and handling of papers assigned for
shepherding. As a result of these efforts, the research program features 95 papers,
the industrial program 8 papers, and the demonstration program 27 demos. The
conference program also includes 9 seminar tutorials and one panel. As a feature
of ICDE conferences in recent years, all papers are presented at a poster session.
Accompanying the main conference are 8 workshops.
The success of ICDE 2013 is in large measure the result of volunteer work by many
committed scientists, all with many demands on their time, who contributed their
expertise and time generously to making the conference a success.
2
We thank the area chairs, mentioned above, and the many program committee
members for all their essential efforts. Next, we thank Sang Cha and Haixun Wang
who served as industrial program chairs; Yoshiharu Ishikawa, Yanchun Zhang and Rui
Zhang who served as demo chairs; Alexandros Labrinidis who served as seminar
chair; Dimitrios Georgakopoulos and Jun Yang who served as panel chairs; Chee Yong
Chan and Kjetil Nørvåg who served as workshops chairs; and the organizers of the
accompanying workshops, including Gottfried Vossen and Min Wang, who chaired the
Ph.D. Symposium.
ICDE 2013 Conference
Message from the ICDE 2013 Program Committee & General Chairs
We also express our deep appreciation of the outstanding work put in over many
months by the organization team: Shazia Sadiq and Heng Tao Shen served as local
organization chairs, Marta Indulska served as finance chair, Mohamed Sharaf served
as web and publicity chair, and Jiaheng Lu and Egemen Tanin served as proceedings
chairs. We thank Kathleen Williamson from The University of Queensland for assisting
with coordination and a wide range of local organization tasks. Carmen Saliba and
Alkenia Winston from the IEEE Computer Society’s Conference Support Services
helped secure the various necessary contracts in a timely manner. We are also thankful
to the many student volunteers from The University of Queensland. We also wish to
acknowledge the CMT team at Microsoft for their assistance.
It has been a pleasure to work with such a committed and insightful group of people
who really care about ICDE and the data management research community. Without
the contributions of all of these people, the conference would not have been a
success.
Two committees consisting of distinguished and trusted members of the data
management community are in charge of identifying the year’s best ICDE paper,
namely Divyakant Agrawal, Dan Suciu,Yufi Tao and Gerhard Weikum (chair), and
the most influential paper ICDE published 10+ years hence, namely AnHai Doan,
Christos Faloutsos, Donald Kossmann, Samuel Madden and Krithi Ramamritham,
with Paul Larson serving as coordinator.
We also gratefully acknowledge the financial support of our sponsors: SAP as
Diamond Sponsor, Microsoft and Tourism and Events Queensland as Platinum
Sponsors, HP and The University of Queensland as Gold Sponsors, CSIRO, RMIT
University, Oracle Labs, SA Center for Big Data Research at Renmin University, The
University of Melbourne, and NICTA as Silver Sponsors, and Google, NEC, Facebook
and The University of New South Wales as Bronze Sponsors.
Finally, we thank all presenters and conference participants. We hope you all enjoy the
conference!
ICDE 2013 PC Chairs
Christian S. Jensen, Aarhus University, Denmark
Chris Jermaine, Rice University, USA
Xiaofang Zhou, The University of Queensland, Australia
ICDE 2013 General Chairs
Rao Kotagiri, The University of Melbourne, Australia
Beng Chin Ooi, National University of Singapore, Singapore
ICDE 2013 Conference
3
Message from the Minister for Tourism,
Major Events, Small Business and the
Commonwealth Games, the Honourable Jann
Stuckey MP
Welcome to Brisbane as Queensland’s capital city plays
host to the IEEE International Conference on Data
Engineering (ICDE) for the first time in the event’s
history, celebrating its 29th year in 2013.
The Newman Government is proud to host delegates
from interstate and overseas, as the ICDE provides a
platform to address issues in designing, building, managing
and evaluating advance data-intensive systems and
applications.
Tourism and events are intrinsically linked. This is why
our Government has merged Events Queensland and
Tourism Queensland into a single entity, Tourism and
Events Queensland, to deliver the best outcomes for the
State.
Business events contribute almost $700 million annually to our economy and events
like ICDE will raise Queensland’s profile and increase visitation to the State.
The 29th IEEE International Conference on Data Engineering joins a growing calendar
of events for Queensland which, for Brisbane, includes the QPAC International Series
featuring the Bolshoi Ballet, British & Irish Lions Tour, Brisbane Festival and many more.
I wish you a successful conference and hope you enjoy your stay in Queensland’s
capital city.
The Honourable Jann Stuckey
Queensland Minister for Tourism, Major Events, Small Business and the
Commonwealth Games
4
ICDE 2013 Conference
Conference Venue
ICDE 2013 will be held at the Sofitel Brisbane Central Hotel, main entrance, 249
Turbot Street Brisbane CBD, and conveniently located above Central Railway Station.
The Sofitel also has a secondary entrance off Ann Street.
Below is a map of the Sofitel Hotel’s conference centre. All conference sessions and
catering will be held in this area. Please note that the only room not located on this
floor is the Odeon room, which is located down the escalator on the ground floor.
BASTILLE 1
CONCORDE
BASTILLE 2
LIFTS
ST GERMAIN
BALLROOM 3
BALLROOM 2
BALLROOM 1
BALLROOM LE GRANDE
ANN STREET LOBBY
ODEON
via escalators
to ground floor
ICDE 2013 Conference
TROCADERO
5
ICDE2013 Conference at a Glance
Monday 8 april
Tuesday 9 april
9 - 9:30AM
Opening
9 - 10:30AM
W1DESWEB (Full Day)
Bastille 2
W2SMDB (Full Day)
St Germaine
W3PrivDB (Full Day)
Concorde
W4MoDA (Half Day) Ballroom 3
W5DGSS (Full Day)
Bastille 1
W6GDM (Full Day)
Ballroom 2
W7DMC (Full Day)
Concorde
PhD SymposiumOdeon
10:30 - 11AM Break
Ballroom 1
9:30 - 10:30AM
Keynote: Vishal Sikka
Ballroom 1
(SAP AG)
10:30 - 11AM Break
11 - 12:30 PM
Seminar 1
Ballroom 2
Seminar 2Odeon
R1 Main Memory Databases
Ballroom 1
R2 MapReduce Algorithms
St Germaine R3 Data History Bastille 1
R4 Top-k Query in Uncertain Data
Bastille 2
Industry 1Concorde
11AM - 12:30 PM
All workshops continue
12:30 - 1:30PM Lunch
1:30 - 3PM
Full day workshops continue
3 - 3:30PM Break
3:30 - 5PM
Full day workshops
7 - 9 PM
IEEE TCDE Member Reception
Summit Restaurant
Mt Coot-Tha
12:30 - 2PM Lunch
2 - 3:30PM
Demo Groups 1 & 2
R5 Uncertainty in Spatial Data
R6 Data Extraction
R7 Trajectory Databases
R8 Social Networks
Ballroom 2
Ballroom 1
St Germaine
Bastille 1
Bastille 2
Industry 2Concorde
Seminar 3Odeon
3:30 - 4PM Break
4 - 5:30PM
Demo Groups 1 & 2
Ballroom 2
R9 Indexing Structures
Ballroom 1
R10Main Memory Query Processing St Germaine
R11Data Mining I Bastille 1
R12Moving Objects Bastille 2
Industry 3Concorde
Seminar 4Odeon
5:30 PM - 7 PMLobby
6
Welcome Reception
ICDE 2013 Conference at a Glance
Wednesday 10 april
9 - 10AM
Thursday 11 april
Ballroom Le Grand
9 - 10AM
Ballroom 1 & 2
Keynote: Alon Halevy (Google Inc)
Keynote: Gustavo Alonso (ETH Zurich)
10 - 10:30AM Break
10 - 10:30AM Break
10:30 - 12PM
10:30 - 12PM
R13Data Cleaning Demo Groups 3 & 4
St Germaine
R14Social Media I
Bastille 1
R15 Data Trust
Bastille 2
R16Data on the Cloud
Concorde
Seminar 5Odeon
Ballroom 3
Panel: Big Data for the Public
Ballroom 1 & 2
R21Security and Privacy
St Germaine
R22Randomized Algorithms for Graphs Bastille 1
R23Distributed Data Processing Bastille 2
R24Data Mining II Concorde
Seminar 7Odeon
12 - 1:30PM SAP Business Lunch
Ballroom Le Grand
1:30 - 2PM
Ballroom Le Grand
ICDE Award Presentations
2 - 3PM
Ballroom Le Grand
K4 - 10 year most influential paper
12 - 1:30PM Lunch
1:30 - 3PM
Demo Groups 3 & 4
R25Lineage & Provenance R26Similarity Search R27Shortest & Direct Query R28Skyline & Snapshot Query
Ballroom 3
St Germaine
Bastille 1
Bastille 2
Concorde
Seminar 8Odeon
3 - 3:30PM Break
3 - 3:30PM Break
3:30 - 5PM
3:30 - 5PM
R17Similarity Ranking St Germaine
R18Spatial Databases
Bastille 1
R19Social Media II
Bastille 2
R20Trees & XML
Concorde
Seminar Session 6Odeon
Posters Session Commences
R29Large Graph Indexing R30Web Data R31Query Optimization R32Data Storage Seminar 9Odeon
5 - 6PM
6:30 - 10 PM
Banquet
Brisbane City Hall
Ballroom 1 & 2
St Germaine
Bastille 1
Bastille 2
Concorde
Ballroom 1 & 2
Posters & Drinks
7
Tuesday 9 April at a Glance: Keynote, Seminar, Industry & Demo Sessions
Opening
9 - 9:30AM
Ballroom 1
9:30 - 10:30AM
Keynote: Vishal Sikka
Ballroom 1
(SAP AG, (p. 31))
10:30 - 11AM Break
Ballroom 2
12:30 - 2PM Lunch
2 - 3:30 PM
TUE/9
Seminar 1: Machine Learning on Big Data (p.38)
11 - 12:30 PM
For details of Research
Sessions for Tuesday 9
April, see next page.
Demo Groups 1 & 2 (p.49)
Twitter+: Build Personalized Newspaper For Twitter
A Generic Database Benchmarking Service
Aeolus: An Optimizer for Distributed Intra-NodeParallel Streaming Systems
Crowd-Answering System via Microblogging
With a Little Help from My Friends
Peeking into the Optimization of Data Flow Programs
with MapReduce-style UDFs
Very Fast Estimation for Result and Accuracy of Big
Data Analytics: the EARL System
3:30 - 4PM 4 - 5.30 PM
8
Break
Demo Groups 1 & 2 (p.49)
Road Network Mix-zones for Anonymous Location
Based Services
Query Time Scaling of Attribute Values in Interval
Timestamped Databases
Extracting Interesting Related Context-dependent
Concepts from Social Media Streams using Temporal
Distributions
VERDICT: Privacy-Preserving Authentication of Range
Queries in Location-based Services
Real-time Abnormality Detection System for Intensive
Care Management
ExpFinder: Finding Experts by Graph Pattern
Matching
ICDE 2013 Conference
Tuesday 9 April at a Glance Keynote, Seminar, Industry & Demo Sessions
10:30 - 11AM Odeon
Seminar 2: Big Data Integration (p.38)
Break
Concorde
Industry 1 (p.39)
Invited Talk: Big Data Analytics at Facebook
Invited Paper: Data Services for E-tailers
Leveraging Search Engine Assets
Invited Paper: SAP HANA Distributed
In-Memory Database System: Transaction,
Session, and Metadata Management
Seminar 3: Workload Management for Big
Data Analytics (p.47)
TUE/9
12:30 - 2 PMLunch
Industry 2 (p.47)
Invited Paper: HFMS: Managing the Lifecycle
and Complexity of Hybrid Analytic Data
Flows
Invited Paper: KuaFu: Closing the Parallelism
Gap in Database Replication
Materialization Strategies in the Vertica
Analytic Database: Lessons Learned
3:30 - 4PM Seminar 4: Knowledge Harvesting from
Text and Web Sources (p.62)
Break
Industry 3 (p.62)
Pipe Break Prediction: A Data Mining
Method
SASH: Enabling Continuous Incremental
Analytic Workflows on Hadoop
Automating Pattern Discovery for Rule
Based Data Standarization Systems
5:30 PM - 7 PMLobby
Welcome Reception
9
Tuesday 9 April at a Glance - Research Sessions
Opening
9 - 9:30AM
Ballroom 1
9:30 - 10:30AM
Keynote: Vishal Sikka
Ballroom 1
(SAP AG, (p. 31))
10:30 - 11AM Ballroom 1
Break
St Germaine
R2 MapReduce Algorithms (p. 33)
CPU and Cache Efficient Management of
Memory-Resident Databases
Finding Connected Components on Mapreduce in Logarithmic Rounds
Identifying Hot and Cold Data in MainMemory Databases
Enumerating Subgraph Instances Using
Map-Reduce
The Adaptive Radix Tree: ARTful Indexing
for Main-Memory Databases
Scalable Maximum Clique Computation
Using MapReduce
TUE/9
11 - 12:30 PM
R1 Main Memory Databases (p. 32)
12:30 - 2PM Lunch
2 - 3:30 PM
R5 Uncertainty in Spatial Data (p. 40)
R6 Data Extraction (p. 42)
Voronoi-based Nearest Neighbor Search
for Multi-Dimensional Uncertain Databases
Attribute Extraction and Scoring: A
Probabilistic Approach
Interval Reverse Nearest Neighbor
Queries on Uncertain Data with Markov
Correlations
TYPifier: Inferring the Type Semantics of
Structured Data
Efficient Tracking and Querying for
Coordinated Uncertain Mobile Objects
3:30 - 4PM R9 Indexing Structures (p. 55)
4 - 5:30 PM
The Bw-Tree: A B-tree for New Hardware
Platforms
Secure and Efficient Range Queries
on Outsourced Databases Using $\
widehat{R}$-trees
An Efficient and Compact Indexing Scheme
for Large-scale Data Store
10
SUSIE: Search Using Services and
Information Extraction
Break
R10 Main Memory Query Processing (p.
57)
Recycling in Pipelined Query Evaluation
Efficient Many-Core Query Execution in
Main Memory Column-Stores
Main-Memory Hash Joins on Multi-Core
CPUs: Tuning to the Underlying Hardware
ICDE 2013 Conference
Tuesday 9 April at a Glance - Research Sessions
10:30 - 11AM Bastille 1
R3 Data History (p. 34)
Time Travel in a Scientific Array Database
Time Travel in Column Stores
R4 Top-k Query in Uncertain Data (p.
36)
Top-k Query Processing in Probabilistic
Databases with Non-Materialized Views
Cleaning Uncertain Data for Top-k Queries
11 - 12:30 PM
Ficklebase: Looking into the Future to Erase
the Past
Break
Bastille 2
TUE/9
Top-K Oracle: A New Way to Present Top-K
Tuples for Uncertain Data
12:30 - 2PM Lunch
R8 Social Networks (p. 45)
Towards Efficient Search for Activity
Trajectories
Scalable and Parallelizable Processing of
Influence Maximization for Large-Scale
Social Networks
On Discovery of Gathering Patterns from
Trajectories
Destination Prediction by Sub-Trajectory
Synthesis and Privacy Protection Against
Such Prediction
SociaLite: Datalog Extensions for Efficient
Social Network Analysis
LinkProbe: Probabilistic Inference on LargeScale Social Networks
3:30 - 4PM Break
R12 Moving Objects (p. 60)
Coupled Clustering Ensemble: Incorporating
Coupling Relationships Both between Base
Clusterings and Objects
Large-Scale Dynamic Taxi Ridesharing
Service
Focused Matrix Factorization For Audience
Selection in Display Advertising
Efficient Notification of Meeting Points
for Moving Groups via Independent Safe
Regions
Graph Stream Classification using Labeled
and Unlabeled Graphs
Efficient Distance-Aware Query Evaluation
on Indoor Moving Objects
4 - 5:30 PM
R11 Data Mining (p. 58)
ICDE 2013 Conference
2 - 3:30 PM
R7 Trajectory Databases (p. 43)
11
Wed 10 April At a Glance: :Keynote, Seminar, Industry & Demo Sessions
9:00 - 10:00AM
Keynote: Alon Halevy
Ballroom Le Grand
(Google INC. (p. 65))
10:00 - 10:30AM Break
Ballroom le Grande
10:30 - 12 PM
For details of Research
Sessions for Wednesday
10 April, see next page.
12 - 1:30 PM SAP Business Lunch
1:30 - 3 PM
1:30 - 2 PM
ICDE Award Presentations (p. 72)
2 - 3 PM
12
3 - 3:30 PM Break
3:30 - 5 PM
WED/10
Keynote: 10 Year Most Influential
Papers (p. 72)
ICDE 2013 Conference
Wed 10 April At a Glance: :Keynote, Seminar, Industry & Demo Sessions
10:00 - 10:30AM Break
Odeon
Seminar 5: Sorting in Space: Multidimensional,
Spatial, and Metric Data Structures for Applications
in Spatial Databases, Geographic Information
Systems (GIS), and Location-based Services (p.72)
12 - 1:30 PM SAP Business Lunch
WED/10
3 - 3:30 PM Break
Seminar 6: Triples in the clouds (p.81)
6:30 PM - 10 PM
Brisbane City Hall, Auditorium, King George Square
Banquet
ICDE 2013 Conference
13
Wed 10 April At a Glance: Research Sessions
9:00 - 10:00AM
Ballroom Le Grand
Keynote: Alon Halevy (Google INC. (p. 65))
10:00 - 10:30AM Break
St Germaine
Bastille 1
10:30 - 12 PM
R13 Data Cleaning (p. 66)
R14Social Media I (p. 67)
HANDS: A Heuristically Arranged NonBackup In-line Deduplication System
LSII: An Indexing Structure for Exact RealTime Search on Microblogs
Holistic Data Cleaning: Putting Violations
Into Context
Utilizing Social Pressure in Recommender
Systems
Inferring Data Currency and Consistency
for Conflict Resolution
Presenting Diverse Location Views with
Real-time Near-duplicate Photo Elimination
12 - 1:30 PM SAP Business Lunch
1:30 - 2 PM
Ballroom Le Grande
WED/10
ICDE Award Presentations (p. 72)
2 - 3 PM
Ballroom Le Grande
Keynote: 10 Year Most Influential Papers
(p. 72)
3 - 3:30 PM 3:30 - 5 PM
14
Break
R17Similarity Ranking (p. 73)
R18Spatial Databases (p. 76)
Efficient Search Algorithm for SimRank
Finding Distance-Preserving Subgraphs in
Large Road Networks
Towards Efficient SimRank Computation on
Large Graphs
RoundTripRank: Graph-based Proximity
with Importance and Specificity
Maximum Visibility Queries in Spatial
Databases
Memory-Efficient Algorithms for Spatial
Network Queries
ICDE 2013 Conference
Wed 10 April At a Glance: Research Sessions
10:00 - 10:30AM Bastille 2
Break
COncorde
R16 Data on the Cloud (p. 70)
Publicly Verifiable Grouped Aggregation
Queries on Outsourced Data Streams
Catch the Wind: Graph Workload Balancing
on Cloud
Trustworthy Data from Untrusted
Databases
EAGRE: Towards Scalable I/O Efficient
SPARQL Query Evaluation on the Cloud
On the Relative Trust between Inconsistent
Data and Inaccurate Constraints
C-Cube: Elastic Continuous Clustering in
the Cloud
10:30 - 12 PM
R15 Data Trust (p. 69)
12 - 1:30 PM SAP Business Lunch
1:30 - 3 PM
WED/10
3 - 3:30 PM Break
R20 Trees & XML (p. 79)
A Unified Model for Stable and Temporal
Topic Detection from Social Media Data
Ontology-based subgraph querying
Crowdsourced Enumeration Queries
On Incentive-based Tagging
ICDE 2013 Conference
Stratification Driven Placement of Complex
Data: A Framework for Distributed Data
Analytics
3:30 - 5 PM
R19Social Media II (p. 77)
Optimizing Approximations of Query
Lineage in Probabilistic XML
15
Thurs 11 April At a Glance: Keynote, Seminar, Industry & Demo Sessions
9:00 - 10:00AM
Ballroom 1
Keynote: Gustavo Alonso
(ETH ZURICH, (p. 82))
10 - 11:30AM Break
Ballroom 1 & 2
11:30 - 12 PM
For details of Research
Sessions for Thursday
11 April, see next page.
11:30 - 12 PM
Panel: Big Data for the Public (p.89)
12 - 1:30 PM Lunch
3:30 - 5 PM
THU/11
1:30 - 3 PM
16
3 - 3:30 PM Break
Posters & Drinks (until 6 PM)
ICDE 2013 Conference
Thurs 11 April At a Glance: Keynote, Seminar, Industry & Demo Sessions
10 - 11:30AM Odeon
Seminar 7: Querying Encrypted Data
(p.89)
Break
Ballroom 3
Demo Groups 3 & 4 (p.90)
Pigora: An Integration System for Probabilistic
Data
Complex Pattern Matching in Complex
Structures: the XSeq Approach
T-Music: A Melody Composer based on
Frequent Pattern Mining
SHARE: Secure information sHaring frAmework
for emeRgency management
KORS: Keyword-aware Optimal Route Search
System
CrowdPlanr: Planning Made Easy with Crowd
ASVTDECTOR: A Practical Near Duplicate
Video Retrieval System
12 - 1:30 PM Lunch
Seminar 8: Shallow Information
Extraction for the Knowledge Web
(p.102)
THU/11
3 - 3:30 PM Demo Groups 3 & 4 (p.90)
YumiInt - A Deep Web Integration System
for Local Search Engines for Geo-referenced
Objects
A Demonstration of the G* Graph Database
System
RECODS: Replica Consistency-On-Demand
Store
SODIT: An Innovative System for Outlier
Detection using Multiple Localized Thresholding
and Interactive Feedback
COLA: A Cloud-based System for Online
Aggregation
Tajo: A Distributed Data Warehouse System on
Large Clusters
RoadAlarm: a Spatial Alarm System on Road
Networks
Break
Seminar 9: Secure and Privacy-Preserving
Database Services in the Cloud (p.108)
ICDE 2013 Conference
17
Thurs 11 April At a Glance: Research Sessions
9:00 - 10:00AM
Keynote: Gustavo Alonso
Ballroom 1
(ETH ZURICH, (p. 82))
10 - 11:30AM St Germaine
R21Security & Privacy (p. 83)
11:30 - 12 PM
Secure Nearest Neighbor Revisited
Accurate and Efficient Private Release of
Datacubes and Contingency Tables
Differentially Private Grids for Geospatial
Data
Break
Bastille 1
R22Randomized Algorithms for Graphs
(p. 85)
Faster Random Walks By Rewiring Online
Social Networks On-The-Fly
Sampling Node Pairs Over Graphs
Link Prediction across Networks by Biased
Cross-Network Sampling
12 - 1:30 PM Lunch
1:30 - 3 PM
R25Lineage & Provenance (p. 96)
R26Similarity Search (p. 97)
SubZero: a Fine-Grained Lineage System for
Scientific Databases
Inverted Linear Quadtree: Efficient Top K
Spatial Keyword Search
Logical Provenance in Data-Oriented
Workflows
Similarity Query Processing for Probabilistic
Sets
Revision Provenance in Text Documents of
Asynchronous Collaboration
Top-k String Similarity Search with EditDistance Constraints
18
3:30 - 5 PM
THU/11
3 - 3:30 PM Break
R29Large Graph Indexing (p. 102)
R30 Web Data (p. 103)
FERRARI: Flexible and Efficient Reachability
Range Assignment for Graph Indexing
Breaking the Top-k Barrier of Hidden Web
Databases
gIceberg: Towards Iceberg Analysis in Large
Graphs
Automatic Extraction of Top-k Lists from
the Web
Top-k Graph Pattern Matching over Large
Graphs
Finding Interesting Correlations with
Conditional Heavy Hitters
ICDE 2013 Conference
Thurs 11 April At a Glance: Research Sessions
10 - 11:30AM Bastille 2
Interval Indexing and Querying on KeyValue Cloud Stores
Robust Distributed Stream Processing
R24 Data Mining II (p. 87)
Learning to Rank from Distant Supervision:
Exploiting Noisy Redundancy for Relational
Entity Search
AFFINITY: Efficiently Querying Statistical
Measures on Time-Series Data
11:30 - 12 PM
R23 Distributed Data Processing (p.
86)
Break
Concorde
Forecasting the Data Cube: A Model
Configuration Advisor for MultiDimensional Data Sets
12 - 1:30 PM Lunch
R28 Skyline & Snapshot Query (p. 100)
On Shortest Unique Substring Queries
On Answering Why-not Questions in
Reverse Skyline Queries
Engineering Generalized Shortest Path
Queries
Efficient Direct Search on Compressed
Genomic Data
Layered Processing of Skyline-Window-Join
(SWJ) Queries using Iteration-Fabric
Efficient Snapshot Retrieval over Historical
Graph Data
3 - 3:30 PM Break
Predicting Query Execution Time: Are
Optimizer Cost Models Really Unusable?
TBF: A Memory-Efficient Replacement
Policy for Flash-based Caches
Query Optimization for Differentially
Private Data Management Systems
Fast Peak-to-Peak Behavior with SSD Buffer
Pool
Top Down Plan Generation: From Theory
to Practice
SELECT Triggers for Data Auditing
THU/11
R32 Data Storage (p. 107)
3:30 - 5 PM
R31 Query Optimisation (p. 105)
ICDE 2013 Conference
1:30 - 3 PM
R27Shortest & Direct Query (p. 98)
19
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
Monday 8 April
Workshops co-located with ICDE 2013
Workshop 1: Data Engineering Meets the Semantic Web – DESWEB
9.00 – 5.00pm Bastille 2
Keynote: Truth Finding on the Deep Web
Xin Luna Dong (Google Inc.)
Regular Papers
WARP: Workload-Aware Replication and Partitioning for RDF
Katja Hose (Aalborg University) Ralf Schenkel (Max Planck Institute for Informatics)
SESM: Semantic Enrichment of Schema Mappings
Yoones A. Sekhavat, Jeffrey Parsons (Memorial University of Newfoundland)
Introducing Shadows: Flexible Document Representation and Annotation on the
Web
Matheus Silva Mota, Claudia Bauzer Medeiros (University of Campinas)
Keynote: Learning to Predict Missing Edges in Real-World Interest Graphs: An
Infinitely Scalable Cloud Approach
Ralf Herbrich (Amazon)
Late-breaking Results,Visions and Challenges
Automated Educated Guessing
Aleksandar Stupar, Sebastian Michel (Saarland University)
Eight Fallacies when Querying the Web of Data
Jürgen Umbrich (National University of Ireland) Claudio Gutierrez (Universidad de
Chile) Aidan Hogan, Marcel Karnstedt, Josiane Xavier Parreira (National University of
Ireland)
Hybrid Graph and Relational Query Processing in Main Memory
Martin Grund, Philippe Cudré-Mauroux (University of Fribourg) Jens Krüger, Hasso
Plattner (Hasso Plattner Institute)
20
ICDE 2013 Conference
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
A Vision for SPARQL Multi-Query Optimization on MapReduce
Kemafor Anyanwu (North Carolina State University)
Recommending Environmental Knowledge As Linked Open Data Cloud Using
Semantic Machine Learning
Ahsan Morshed, Ritaban Dutta (CSIRO) Jagannath Aryal (University of Tasmania)
Workshop 2: Self-Managing Database Systems - SMDB
9.00 – 5.00pm St Germaine
Keynote
Timos Sellis (RMIT University)
Applications of Self-Management
Realistic Tenant Traces for Enterprise DBaaS
Jan Schaffner (Hasso Plattner Institute) Tim Januschowski (SAP Innovation Center)
Model Ensemble Tools for Self-Management in Data Centers
Jin Chen (University of Toronto) Gokul Soundararajan (NetApp / University of
Toronto) Saeed Ghanbari, Cristiana Amza (University of Toronto)
Total Operator State Recall -- Cost-effective Reuse of Results in Greenplum
Database
George C. Caragea, Carlos Garcia-Alvarado, Michalis Petropoulos, Florian M. Waas
(Greenplum, A Division of EMC)
Foundations of Self-Management
INUM+: A Leaner, More Accurate and More Efficient Fast What-if Optimizer
Rui Wang, Quoc Trung Tran, Ivo Jimenez, Neoklis Polyzotis (University of California
Santa Cruz)
Automatic Schema Design for Co-Clustered Tables
Stephan Baumann (Ilmenau University of Technology) Peter Boncz (Centrum
Wiskunde & Informatica) Kai-Uwe Sattler (Ilmenau University of Technology)
Performance Optimization for Distributed Intra-Node-Parallel Streaming Systems
Matthias J. Sax (Humboldt-Universität zu Berlin) Malu Castellanos, Qiming Chen,
Meichun Hsu (Hewlett-Packard Laboratories)
Self-managing Load Shedding for Data Stream Management Systems
Thao N. Pham, Panos K. Chrysanthis, Alexandros Labrinidis (University of Pittsburgh)
Panel Self-management and Big Data
ICDE 2013 Conference
21
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
Workshop 3: Privacy-Preserving Data Publication and Analysis PrivDB
9.00 – 5.00pm
Concorde
Keynote: Challenges to De-anonymization and Privacy protection in Online
Advertising
Peng Liu (Sohu Inc.)
Research Session 1
Empirical Privacy and Empirical Utility of Anonymized Data
Graham Cormode, Cecilia M. Procopiuc (AT&T Labs-Research) Entong Shen
(North Carolina State University) Divesh Srivastava (AT&T Labs-Research) Ting Yu
(North Carolina State University)
Privacy-Protecting Index for Outsourced Databases
Chung-Min Chen, Andrzej Cichocki, Allen McIntosh, Euthimios Panagos (Applied
Communication Sciences)
On Syntactic Anonymity and Differential Privacy
Chris Clifton (Purdue University) Tamir Tassa (The Open University, Israel)
Tutorial : Building Blocks of Privacy: Differentially Private Mechanisms
Graham Cormode (AT&T Labs-Research)
Invited Talk : Accurate Analysis of Large Private Datasets
Vibhor Rastogi (Google Inc.)
Research Session 2
On Information Leakage by Indexes over Data Fragments
Sabrina De Capitanti di Vimercati, Sara Foresti (Università degli Studi di Milano)
Sushil Jajodia (George Mason University) Stefano Paraboschi (Università degli Studi
di Bergamo) Pierangela Samarati (Università degli Studi di Milano)
Privacy against Aggregate Knowledge Attacks
Olga Gkountouna, Katerina Lepenioti (National Technical University of Athens)
Manolis Terrovitis (Institute for the Management of Information Systems)
22
ICDE 2013 Conference
DETAILED PROGRAM FOR MONDAY 8 APRIL
Workshop 4: Mobile Data Analytics - MoDA
9.00 – 12.30pm Ballroom 3
MON/8
Session 1 - Research Papers
Client-Centric OLAP on Mobile Devices
Zheng Xu, Wo-Shun Luk (Simon Fraser University) Stephen Petchulat (SAP
Research Canada)
Signature Generation for Sensitive Information Leakage in Android Applications
Hiroki Kuzuno, Satoshi Tonami (SECOM)
RFID Based Vehicular Networks for Smart Cities
Joydeep Paul, Baljeet Malhotra, Simon Dale (SAP Next Business and Technology)
Meng Qiang (National University of Singapore)
Session 2 - Invited Papers
ShareLikesCrowd: Mobile Analytics for Participatory Sensing and Crowd-sourcing
Applications
Arkady Zaslavsky, Prem Prakash Jayaraman (ICT Centre, CSIRO) Shonali
Krishnaswamy (I2R Singapore)
Strong Location Privacy: A Case Study on Shortest Path Queries
Kyriakos Mouratidis (Singapore Management University)
On the Link(s) Between “D” and “A” in Mobile Data Analytics
Goce Trajcevski (Northwestern University)
Workshop 5: Data-Driven Decision Guidance and Support Systems
- DGSS
9.00 – 5.00pm
Bastille 1
Keynote: Will Internet of Things Flood DGSS with Data?
Arkady Zaslavsky (ICT Centre, CSIRO)
Using Military Operational Planning System Data to Drive Reserve Stocking
Decisions
Rajesh Thiagarajan, Mirza Arif Mekhtiev, Greg Calbert, Nikifor Jeremic, Don Gossink
(Defence Science and Technology Organisation)
SmartCart: A Consolidated Shopping Cart for Pareto-Optimal Sourcing and Fair
Discount Distribution
Brian Goodhart, Venkata Yerneni, Alex Brodsky, Venkata Rudraraju, Nathan Egge
(George Mason University)
ICDE 2013 Conference
23
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
Water Desalination Supply Chain Modelling and Optimization
Malak T. Al-Nory, Stephen C. Graves (Massachusetts Institute of Technology)
Using Graphical Models and Multi-attribute Utility Theory for Probabilistic
Uncertainty Handling in Large Systems, with Application to the Nuclear
Emergency Management
Manuele Leonelli, James Q. Smith (The University of Warwick)
Multivariate Data-Driven Decision Guidance for Clinical Scientists
Frada Burstein, Daswin De Silva (Monash University) Herbert F. Jelinek (Charles
Sturt University) Andrew Stranieri (University of Ballarat)
ODSS: A Decision Support System for Ocean Exploration
Kevin Gomes, Danelle Cline, Duane Edgington, Michael Godin, Thom Maughan, Mike
McCann, Tom O'Reilly, Fred Bahr, Francisco Chavez, Monique Messié (Monterey Bay
Aquarium Research Institute) Jnaneshwar Das (University of Southern California)
Kanna Rajan (Monterey Bay Aquarium Research Institute)
Wrap-Up
Discussion on the future and organization of DGSS
Workshop 6: Graph Data Management: Techniques and Applications
- GDM
9.00 – 5.00pm
Ballroom 1
Keynote Efficient Processing of Complex Join Queries over Graphs on the
Cloud
Lei Chen (Hong Kong University of Science and Technology)
Research Session 1
SuReQL: A Subgraph Match Based Relational Model for Large Graphs (Short
Paper)
Shijie Zhang, Jiong Yang, Boya Sun (Case Western Reserve University)
Ranking Outlier Nodes in Subspaces of Attributed Graphs
Emmanuel Müller (Karlsruhe Institute of Technology / University of Antwerp) Patricia
Iglesias Sánchez, Yvonne Mülle, Klemens Böhm (Karlsruhe Institute of Technology)
Chordless Cycles in Networks
John Pfaltz (University of Virginia)
Keynote
Haixun Wang (Microsoft Research)
24
ICDE 2013 Conference
DETAILED PROGRAM FOR MONDAY 8 APRIL
Research Session 2
MON/8
PSOGD: A New Method for Graph Drawing
Jianhua Qu (Shandong Normal University), Yi Song, Stéphane Bressan (National
University of Singapore)
Clustering Remote RDF Data Using SPARQL Update Queries
Letao Qi, Harris Lin, Vasant Honavar (Iowa State University)
Workshop 7: Data Management in the Cloud - DMC
9.00 – 5.00pm
Ballroom 2
Keynote Amr El Abbadi (University of California – Santa Barbara)
Paper Session 1
HotROD: Managing Grid Storage with On-Demand Replication
Sriram Rao (Microsoft Research) Benjamin Reed (Osmeta Inc.) Adam Silberstein
(Trifacta Inc.)
Materialized Views for Eventually Consistent Record Stores
Changjiu Jin, Rui Liu, Kenneth Salem (University of Waterloo)
Packing Light: Portable Workload Performance Prediction for the Cloud
Jennie Duggan (Brown University) Yun Chi, Hakan Hacıgümüş, Shenghuo Zhu (NEC
Laboratories America) Ugur Çetintemel (Brown University)
Paper Session 2
P-Mine: Parallel Itemset Mining on Large Datasets
Elena Baralis, Tania Cerquitelli, Silvia Chiusano, Alberto Grand (Politecnico di Torino)
Towards Dynamic Pricing-Based Collaborative Optimizations for Green Data
Centers
Yang Li (University of Pennsylvania) David Chiu (Washington State University)
Changbin Liu (AT&T Labs-Research) Linh T.X. Phan, Tanveer Gill, Sanchit Aggarwal,
Zhuoyao Zhang, Boon Thau Loo (University of Pennsylvania) David Maier (Portland
State University) Bart McManus (Bonneville Power Administration - TOT/DITT2)
ISP Business Models in Caching
Jörn Künsemöller (University Paderborn) Nan Zhang (Aalto University) João Soares
(Portugal Telecom Inovação)
Panel Discussion
ICDE 2013 Conference
25
DETAILED PROGRAM FOR MONDAY 8 APRIL
ICDE-13 PhD Symposium
MON/8
9.00 – 5.00pm
Odeon
It’s all about Data
Taming the Metadata Mess
V.M. Megler (Portland State University)
The rapid growth of scientific data shows no sign of abating. This growth has
led to a new problem: with so much scientific data at hand, stored in thousands
of datasets, how can scientists find the datasets most relevant to their research
interests? We have addressed this problem by adapting Information Retrieval
techniques, developed for searching text documents, into the world of (primarily
numeric) scientific data. We propose an approach that uses a blend of automated
and “semi-curated” methods to extract metadata from large archives of scientific
data, then evaluates ranked searches over this metadata. We describe a challenge
identified during an implementation of our approach: the large and expanding
list of environmental variables captured by the archive do not match the list of
environmental variables in the minds of the scientists. We briefly characterize the
problem and describe our initial thoughts on resolving it.
High Quality Information Provisioning and Data Pricing
Florian Stahl (University of Münster)
This paper presents ideas on how to advance the research on high quality
information provisioning and information pricing. To this end, the current state of the
art in combining data curation and information provisioning as well as in regard to
data pricing is reviewed. Based on that, open issues, such as tailoring data to a user’s
need and determining a market value of data, are identified. As preliminary solutions,
it is proposed to investigate the identified problems in an integrated manner.
A Framework of Ontology Guided Data Linkage for Evidence based Knowledge
Extraction and Information Sharing
Mohammed Gollapalli (The University of Queensland)
There has been a surge of interests in developing probabilistic techniques for
linking semantic equivalent datasets. The key objective is to transform the structure
of the induced data into a concise synopsis. Current techniques primarily focus
on performing pair-wise attribute matching and pay little attention in discovering
direct and weighted correlations among ontological clusters through multi-faceted
classification. In this paper, we introduce a novel Ontology Guided Data Linkage
(OGDL) framework for self-organising and discovering schema structures through
constructing a hierarchical cluster mapping trees. Furthermore, we extend our
OGDL framework by introducing a novel faceted search engine for semantic
interoperability of data and subsequent decision support analysis, and use it to map
26
ICDE 2013 Conference
DETAILED PROGRAM FOR MONDAY 8 APRIL
fast cluster browsing, user friendly querying and semantic reasoning learning needs.
It’s now about Database (and other) Systems
MON/8
On Answering Why and Why-not Questions in Databases
Md. Saiful Islam (Swinburne University of Technology)
There is a growing interest in allowing users to ask questions on received results
in the hope of improving the usability of database systems. This research aims
at answering the so called why and why-not questions on received results w.r.t.
different query settings in databases. The main goals of this research are: (i) studying
the problem of answering the why and the why-not questions in databases; (ii)
finding efficient strategies for answering these questions in terms of different
query settings and (iii) finally, developing a framework that can take advantage of
the existing data indexing and query evaluation techniques available to answer
such questions in databases. We believe that the research undertaken by us can
contribute towards improving the usability of traditional database systems.
Towards Elastic Key-value Stores on IaaS
Han Li (University of New South Wales)
Key-value stores such as Cassandra and HBase have gained popularity as for their
scalability and high availability in the face of heavy workloads and hardware failure.
Many enterprises are deploying applications backed by key-value stores on to
resources leased from Infrastructure as a Service (IaaS) providers. However, current
key-value stores are unable to take full advantage of the resource elasticity provided
by IaaS providers due to several challenges: i) high performance of data access in
virtualised environments; ii) load-rebalancing as the system scales up and down; and
iii) the lack of autoscaling controllers. In this paper I present my research efforts
on addressing these issues to provide an elastic key-value store deployed in IaaS
environments.
User-Oriented Modelling of Scientific Workflows for High Frequency Event Data
Analysis
Aarthi Natarajan (University of New South Wales)
Whether it is research scientists in computational physics, astronomy, environmental
science, genomics or financial services, all these varying disciplines have been
challenged by the analysis of Big Data. They are all required to perform multi-step
analysis tasks to turn this data into actionable insight, from which critical decisions
can be made. Two data processing models that have rapidly evolved in the past
decade to support data analysts are Complex Event Processing and Scientific
Workflows. Our research adds a new dimension to scientific workflows, by
extending them to incorporate the handling of event-streams and aims to provide a
more efficient and faster approach to analyse vast amounts of data. Our model also
aims at facilitating conceptual modelling of analytical processes - to enable domain
ICDE 2013 Conference
27
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
experts to build abstract, exploratory analysis processes in a user-friendly manner
without the concerns of underlying technology and transparently maps them to
concrete implementations at run-time.
Short Presentations
Self-organizing Structured RDF in MonetDB
Minh-Duc Pham (Centrum Wiskunde & Informatica)
The semantic web uses RDF as its data model, providing ultimate flexibility for
users to represent and evolve data without need of a schema. Yet, this flexibility
poses challenges in implementing efficient RDF stores, leading from plans with very
many self-joins to a triple table, difficulties to optimize these, and a lack of data
locality since without a notion of multi-attribute data structure, clustered indexing
opportunities are lost. Apart from performance issues, users of huge RDF graphs
often have problems formulating queries as they lack any system-supported notion
of the structure in the data. In this research, we exploit the observation that real
RDF data, while not as regularly structured as relational data, still has the great
majority of triples conforming to regular patterns. We conjecture that a system
that would recognize this structure automatically would both allow RDF stores to
become more efficient and also easier to use. Concretely, we propose to derive
self-organizing RDF that stores data in PSO format in such a way that the regular
parts of the data physically correspond to relational columnar storage; and propose
RDFscan/RDFjoin algorithms that compute star-patterns over these without wasting
effort in self-joins. These regular parts, i.e. tables, are identified on ingestion by a
schema discovery algorithm -- as such users will gain an SQL view of the regular
part of the RDF data. This research aims to produce a state-of-the-art SPARQL
frontend for MonetDB as a by-product, and we already present some preliminary
results on this platform.
E-Research Event Data Quality
Weisi Chen (University of New South Wales)
One of the most important data types e-Researchers use to conduct analysis
processes is “event data”, which records information of some timed events in a
particular domain. However, real-world event data is usually of poor quality, resulting
in large amounts of money and labour to tackle the ensuing problems. Existing
solutions to event data quality are very limited, mostly supporting merely data
quality in general without facilitating the ease of event pattern detection; existing
event processing systems, on the other hand, are very inefficient in dealing with data
quality issues. In this research, we have summarised the criteria to address event
data quality issues and compared possible solutions including knowledge-based
systems and event processing systems. We conclude by proposing an approach that
combines a rule-based system with an event processing system in a novel way.
28
ICDE 2013 Conference
DETAILED PROGRAM FOR MONDAY 8 APRIL
MON/8
Indexing and Querying Moving Objects In Indoor Spaces
Sultan Alamri (Monash University)
Spatial database indexes are basically designed to speed up retrievals where it
is usually assumed that the objects of interest are constant unless conspicuously
updated. Therefore, capturing continuously moving objects in traditional spatial
indexes will require frequent updates of the locations of these objects. This paper
outlines a PhD thesis that addresses the challenges of indexing the moving objects in
indoor spaces. The main goal of this thesis is to develop new indoor index structures
for moving objects focusing on the following four challenges: (1) introducing a
queries taxonomy for moving objects to illustrate the query types for the databases
of moving objects; (2) introducing an adjacency index structure for moving objects in
indoor spaces; (3) capturing both spatial and temporal properties in an indoor data
structure; (4) introducing an index structure for moving objects in indoor spaces that
is based on a specific type of movement pattern.
Stock Prediction by Searching Similar Candlestick Charts
Zen-Yu Quan (National Central University, Taiwan)
This research applies the content-based image retrieval (CBIR) technique for stock
prediction. In particular, low-level image features, including wavelet texture and
Canny edge are extracted from candlestick charts. Then, similar historical candlestick
charts represented by the low-level features to the query chart are retrieved, in
which the ‘future’ stock movements of the retrieved charts are used for predicting
the stock price of the query chart.
News Recommendation Based on Web Usage and Web Content Mining
Husna Sarirah Husin, (RMIT University)
In the last decade, online newspapers have become a viable alternative to
conventional hardcopy papers. Many studies have shown that digital media have
increased their share of Internet audience. In this study, we use Web usage and Web
content mining techniques to recommend news articles to users. We are using Web
server logs from a Malaysian newspaper, Berita Harian that will be combined with
the Web content pages to discover the web users’ navigational patterns. We plan to
improve existing Web usage mining techniques in deriving user profiles and find a
novel way to combine the user profiles with Web content pages.
Making the H-index More Relevant: A Step Towards Standard Classes For Citation
Classification
Mohammad Abdullatif (The University of Auckland)
The H-index is gaining popularity as a way of measuring the research impact of an
academic paper. However, it has been criticized because it gives all citations equal
weight. Citation classification can solve this criticism by categorising citations based
on the purpose or function of the citation. An important element for performing
ICDE 2013 Conference
29
MON/8
DETAILED PROGRAM FOR MONDAY 8 APRIL
citation classification is the presence of a standard set of classes (known as a
classification scheme) to enable the comparison between the accuracy of the
different techniques currently used to perform citation classification. Such standard
scheme is not available and therefore we aim to fill this gap be generating a citation
classification scheme automatically. The scheme is generated by clustering 4 large
datasets of sentences containing citations using X-means. The main contribution for
this research is adapting the similarity distance between verbs extracted from the
citation sentences using WordNet.
Moderated Discussion
How to Survive as a Ph.D. Student – The Do’s and Don’t’s
Contributions from Alan Fekete (University of Sydney) Johann Christoph Freytag
(Humboldt University), Gottfried Vossen (University of Münster) and others.
30
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Tuesday 9 April
Conference Opening
9 - 9:30 AM
Chair: Beng Chin Ooi (National University of Singapore)Ballroom 1 & 2
Keynote 1
9:30 - 10:30AMChair: Xiaofang Zhou (The University of Queensland) Ballroom 1 & 2
Re-thinking the Performance of Information Processing Systems
TUE/9
Vishal Sikka (SAP AG)
Abstract: Recent advances in hardware and software technologies have enabled us
to re-think how we architect databases to meet the demands of today’s information
systems. However, this makes existing performance evaluation metrics obsolete. In
this paper, I describe SAP HANA a novel, powerful database platform that leverages
the availability of large main memory and massively parallel processors. Based on
this, I propose a new, multi-dimensional performance metric that better reflects the
value expected from today’s complex information systems.
Bio: Dr. Vishal Sikka is a member of the Executive Board of SAP AG, heading
technology and innovation for the company. Sikka has responsibility for technology
and platform products, including database, especially the industry breakthrough
in-memory database SAP HANA, as well as analytics, mobile, application platform,
and middleware. He drives emerging technologies and advanced development for
the next-generation technology platform, applications, and tools. He also oversees
key technology partnerships, customer co-innovation, and incubation of emerging
businesses. He has global responsibility for SAP Research, as well as academic and
government relations. Sikka has been Chief Technology Officer of SAP since 2007,
responsible for the overall technology, architecture, and product standards across
the entire SAP product portfolio. He is the creator of the concept of “timeless
software,” which underpins SAP architecture and innovation strategy. Sikka holds
a Doctorate in Computer Science from Stanford University in California, and his
experience includes research in Artificial Intelligence, Programming Models and
Automatic Programming, as well as Information Management and Integration – at
Stanford, at Xerox Palo Alto Labs, and as founder of two startup companies.
ICDE 2013 Conference
31
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Research 1: Main Memory Databases
11AM - 12:30PM Chair: Philippe Cudre-Mauroux (MIT) Ballroom 1
TUE/9
CPU and Cache Efficient Management of Memory-Resident Databases
Holger Pirk (Centrum Wiskunde & Informatica) Florian Funke (Technische Universität
München) Martin Grund (Hasso Plattner Institute) Thomas Neumann (Technische
Universität München) Ulf Leser (Humboldt Universität zu Berlin) Stefan Manegold
(Centrum Wiskunde & Informatica) Alfons Kemper (Technische Universität München)
Martin Kersten (Centrum Wiskunde & Informatica)
Memory-Resident Database Management Systems (MRDBMS) have to be optimized
for two resources: CPU cycles and memory bandwidth. To optimize for bandwidth
in mixed OLTP/OLAP scenarios, the hybrid or Partially Decomposed Storage Model
(PDSM) has been proposed. However, in current implementations, bandwidth
savings achieved by partial decomposition come at increased CPU costs. To achieve
the aspired bandwidth savings without sacrificing CPU efficiency, we combine
partially decomposed storage with Just-in-Time (JiT) compilation of queries, thus
eliminating CPU inefficient function calls. Since existing cost based optimization
components are not designed for JiT-compiled query execution, we also develop a
novel approach to cost modeling and subsequent storage layout optimization. Our
evaluation shows that the JiT-based processor maintains the bandwidth savings of
previously presented hybrid query processors but outperforms them by two orders
of magnitude due to increased CPU efficiency.
Identifying Hot and Cold Data in Main-Memory Databases
Justin Levandoski (Microsoft Research) Per-Åke Larson (Microsoft Research) Radu Stoica
(École Polytechnique Fédérale de Lausanne)
Main memories are becoming sufficiently large that most OLTP databases could
be stored entirely in main memory, but this may not be the best solution. OLTP
workloads typically exhibit skewed access patterns where some records are hot
(frequently accessed) but many records are cold (infrequently or never accessed).
It is more economical to store the coldest records on secondary storage such as
flash. As a first step towards managing cold data in main-memory databases we
investigate how to efficiently identify hot and cold data. We propose to log record
accesses, possibly only a sample to reduce overhead, and perform offline analysis to
estimate record access frequencies. We present four estimation algorithms based
on exponential smoothing and experimentally evaluate their efficiency and accuracy.
We find that exponential smoothing provides very accurate estimates and closeto-perfect classification. Our most efficient algorithm is able to analyze a log of 1B
accesses in sub-second time on a workstation-class machine.
32
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
The Adaptive Radix Tree: ARTful Indexing for Main-Memory Databases
TUE/9
Viktor Leis, Alfons Kemper, Thomas Neumann (Technische Universität München)
Main memory capacities have grown up to a point where most databases fit
into RAM. For main-memory database systems, index structure performance is
a critical bottleneck. Traditional in-memory data structures like balanced binary
search trees are not efficient on modern hardware, because they do not optimally
utilize on-CPU caches. Hash tables, also often used for main-memory indexes, are
fast but only support point queries. To overcome these shortcomings, we present
ART, an adaptive radix tree (trie) for efficient indexing in main memory. Its lookup
performance surpasses highly tuned, read-only search trees, while supporting very
efficient insertions and deletions as well. At the same time, ART is very space
efficient and solves the problem of excessive worst-case space consumption,
which plagues most radix trees, by adaptively choosing compact and efficient data
structures for internal nodes. Even though ART’s performance is comparable to hash
tables, it maintains the data in sorted order, which enables additional operations like
range scan and prefix lookup.
Research 2: MapReduce Algorithms
11AM - 12:30PM
Chair: Shivnath Babu (Duke University)
St Germaine
Finding Connected Components in Map-reduce in Logarithmic Rounds
Vibhor Rastogi (Google Inc.) Ashwin Machanavajjhala (Duke University) Laukik Chitnis,
Anish Das Sarma (Google Inc.)
Given a large graph G = (V,E) with millions of nodes and edges, how do we
compute its connected components efficiently? Recent work addresses this
problem in map-reduce, where a fundamental trade-off exists between the number
of map-reduce rounds and the communication of each round. Denoting d the
diameter of the graph, and n the number of nodes in the largest component, all
prior techniques for map-reduce either require a linear, Θ(d), number of rounds,
or a quadratic, Θ(n|V | + |E|), communication per round. We propose here two
efficient map-reduce algorithms: (i) Hash-Greater-to-Min, which is a randomized
algorithm based on PRAM techniques, requiring O(log n) rounds and O(|V | +
|E|) communication per round, and (ii) Hash-to-Min, which is a novel algorithm,
provably finishing in O(log n) iterations for path graphs. The proof technique used
for Hash-to-Min is novel, but not tight, and it is actually faster than Hash- Greaterto-Min in practice. We conjecture that it requires 2 log d rounds and 3(|V | + |E|)
communication per round, as demonstrated in our experiments. Using secondary
sorting, a standard mapreduce feature, we scale Hash-to-Min to graphs with very
large connected components. Our techniques for connected components can be
applied to clustering as well. We propose a novel algorithm for agglomerative single
linkage clustering in map-reduce. This is the first mapreduce algorithm for clustering
ICDE 2013 Conference
33
DETAILED PROGRAM FOR TUESDAY 9 APRIL
in at most O(log n) rounds, where n is the size of the largest cluster. We show the
effectiveness of all our algorithms through detailed experiments on large synthetic as
well as real-world datasets.
TUE/9
Enumerating Subgraph Instances Using Map-Reduce
Foto N. Afrati, Dimitris Fotakis (National Technical University of Athens) Jeffrey D. Ullman
(Stanford University)
The theme of this paper is how to find all instances of a given ``sample’’ graph in
a larger “data graph”, using a single round of map-reduce. For the simplest sample
graph, the triangle, we improve upon the best known such algorithm. We then
examine the general case, considering both the communication cost between
mappers and reducers and the total computation cost at the reducers. To minimize
communication cost, we exploit the techniques of (Afrati and Ullman, TKDE~2011)
for computing multiway joins (evaluating conjunctive queries) in a single map-reduce
round. Several methods are shown for translating sample graphs into a union of
conjunctive queries with as few queries as possible. We also address the matter of
optimizing computation cost. Many serial algorithms are shown to be “convertible”,
in the sense that it is possible to partition the data graph, explore each partition in
a separate reducer, and have the total computation cost at the reducers be of the
same order as the computation cost of the serial algorithm.
Scalable Maximum Clique Computation Using MapReduce
Jingen Xiang, Cong Guo, Ashraf Aboulnaga (University of Waterloo)
We present a scalable and fault-tolerant solution for the maximum clique problem
based on the MapReduce framework. The key contribution that enables us to
effectively use MapReduce is a recursive partitioning method that partitions the
graph into several subgraphs of similar size. After partitioning, the maximum cliques
of the different partitions can be computed independently, and the computation
is sped up using a branch and bound method. Our experiments show that our
approach leads to good scalability, which is unachievable by other partitioning
methods since they result in partitions of different sizes and hence lead to load
imbalance. Our method is more scalable than an MPI algorithm, and is simpler and
more fault tolerant.
34
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Research 3: Time Travel in Database
11AM - 12:30PM
Chair: Robert Ikeda (Stanford University)
Bastille 1
Ficklebase: Looking into the Future to Erase the Past
TUE/9
Sumeet Bajaj, Radu Sion (Stony Brook University)
It has become apparent that in the digital world data once stored is never truly
deleted even when such an expunction is desired either as a normal system
function or for regulatory compliance purposes. Forensic Analysis techniques on
systems are often successful at recovering information said to have been deleted
in the past. Efforts aimed at thwarting such forensic analysis of systems have either
focused on (i) identifying the system components where deleted data lingers and
performing a secure delete operation over these remnants, or (ii) designing history
independent data structures that hide information about past operations which
result in the current system state. Yet, new data is constantly derived by processing
existing (input) data which makes it increasingly difficult to remove all traces of
this existing data, i.e., for regulatory compliance purposes. Even after deletion,
significant information can linger in and be recoverable from the side effects the
deleted data records left on the currently available state. In this paper we address
this aspect in the context of a relational database, such that when combined with
(i) & (ii), complete erasure of data and its effects can be achieved (“untraceable
deletion”). We introduce Ficklebase – a relational database wherein once a tuple
has been “expired” – any and all its side effects are removed, thereby eliminating all
its traces, rendering it unrecoverable, and also guaranteeing that the deletion itself is
undetectable. We present the design and evaluation of Ficklebase, and then discuss
several of the fundamental functional implications of un-traceable deletion.
Time Travel in a Scientific Array Database
Emad Soroush, Magdalena Balazinska (University of Washington)
In this paper, we present TimeArr, a new storage manager for an array database.
TimeArr supports the creation of a sequence of versions of each stored array and
their exploration through two types of time travel operations: selection of a specific
version of a (sub)-array and a more general extraction of a (sub)-array history, in
the form of a series of (sub)-array versions. TimeArr contributes a combination
of array-specific storage techniques to efficiently support these operations. To
speed-up array exploration, TimeArr further introduces two additional techniques.
The first is the notion of approximate time travel with two types of operations:
approximate version selection and approximate history. For these operations, users
can tune the degree of approximation tolerable and thus trade-off accuracy and
performance in a principled manner. The second is to lazily create short connections,
called skip links, between the same (sub)-arrays at different versions with similar
data patterns to speed up the selection of a specific version. We implement TimeArr
ICDE 2013 Conference
35
DETAILED PROGRAM FOR TUESDAY 9 APRIL
within the SciDB array processing engine and demonstrate its performance through
experiments on two real datasets from the astronomy and earth sciences domains.
TUE/9
Time Travel in Column Stores
Martin Kaufmann, Amin A. Manjili (ETH Zürich / SAP AG) Stefan Hildenbrand, Donald
Kossmann (ETH Zürich) Andreas Tonder (SAP AG)
Recent studies have shown that column stores can outperform row stores
significantly. This paper explores alternative approaches to extend column stores
with versioning, i.e., time travel queries and the maintenance of historic data. On
the one hand, adding versioning can actually simplify the design of a column store
because it provides a solution for the implementation of updates, traditionally
a weak point in the design of column stores. On the other hand, implementing
a versioned column store is challenging because it imposes a two dimensional
clustering problem: should the data be clustered by row or by version? This paper
devises the details of three memory layouts: clustering by row, clustering by
version, and hybrid clustering. Performance experiments demonstrate that all three
approaches outperform a (traditional) versioned row store. The efficiency of these
three memory layouts depends on the query and update workload. Furthermore,
the performance experiments analyze the time-space tradeoff that can be made in
the implementation of versioned column stores.
Research 4: Top-k Query in Uncertain Data
11 - 12:30PM Chair: Wenjie Zhang (University of New South Wales)
Bastille 2
Top-k Query Processing in Probabilistic Databases with Non-Materialized Views
Maximilian Dylla, Iris Miliaraki (Max Planck Institute for Informatics) Martin Theobald
(University of Antwerp)
We investigate a novel approach of computing confidence bounds for top-k ranking
queries in probabilistic databases with non-materialized views. Unlike related
approaches, we present an exact pruning algorithm for finding the top-ranked
query answers according to their marginal probabilities without the need to first
materialize all answer candidates via the views. Specifically, we consider conjunctive
queries over multiple levels of select-project-join views, the latter of which are
cast into Datalog rules which we ground in a top-down fashion directly at query
processing time. To our knowledge, this work is the first to address integrated data
and confidence computations for intensional query evaluations in the context of
probabilistic databases by considering confidence bounds over first-order lineage
formulas. We extend our query processing techniques by a tool-suite of scheduling
strategies based on selectivity estimation and the expected impact on confidence
bounds. Further extensions to our query processing strategies include improved
top-k bounds in the case when sorted relations are available as input, as well as
36
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
the consideration of recursive rules. Experiments with large datasets demonstrate
significant runtime improvements of our approach compared to both exact and
sampling-based top-k methods over probabilistic data.
Cleaning Uncertain Data for Top-k Queries
TUE/9
Luyi Mo, Reynold Cheng, Xiang Li, David W. Cheung, Xuan S. Yang (The University of Hong
Kong)
The information managed in emerging applications, such as sensor networks,
location-based services, and data integration, is inherently imprecise. To handle data
uncertainty, probabilistic databases have been recently developed. In this paper, we
study how to quantify the ambiguity of answers returned by a probabilistic top-k
query. We develop efficient algorithms to compute the quality of this query under
the possible world semantics. We further address the cleaning of a probabilistic
database, in order to improve top-k query quality. Cleaning involves the reduction
of ambiguity associated with the database entities. For example, the uncertainty of a
temperature value acquired from a sensor can be reduced, or cleaned, by requesting
its newest value from the sensor. While this cleaning operation may produce a
better query result, it may involve a cost and fail. We investigate the problem of
selecting entities to be cleaned under a limited budget. Particularly, we propose an
optimal solution and several heuristics. Experiments show that the greedy algorithm
is efficient and close to optimal.
Top-K Oracle: A New Way to Present Top-K Tuples for Uncertain Data
Chunyao Song, Zheng Li, Tingjian Ge (University of Massachusetts, Lowell)
Managing noisy and uncertain data is needed in a great number of modern
applications. A major difficulty in managing such data is the sheer number of query
result tuples with diverse probabilities. In many cases, users have a preference over
the tuples in a deterministic world, determined by a scoring function. Yet it has been
a challenging problem to return top-k for uncertain data. Various semantics have
been proposed, and they have been shown to give wildly different tuple rankings.
In this paper, we propose a completely different approach. Instead of returning
users k tuples, which are merely one point in the complex distribution of top-k
tuple vectors, we provide a so-called top-k oracle and users can arbitrarily query
it. Intuitively, an oracle is a black box that, whenever given an SQL query, returns
its result. Any information we give is based on faithful, best-effort estimates of
the ground-truth top-k tuples. This is especially critical in emergency response
applications and in monitoring top-k applications. Furthermore, we are the first to
provide the nested query capability with the uncertain top-k result being a subquery.
We devise various query processing algorithms for top-k oracles, and verify their
efficiency and accuracy through a systematic evaluation over real-world and
synthetic datasets.
ICDE 2013 Conference
37
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Seminar 1: Machine Learning on Big Data
11AM - 12:30PM
Ballroom 2
Tyson Condie, Paul Mineiro (Microsoft Research, USA) Neoklis Polyzotis (University of
California, Santa Cruz) Markus Weimer (Microsoft Research, USA)
Statistical Machine Learning has undergone a phase transition from a pure academic
endeavor to being one of the main drivers of modern commerce and science.
Even more so, recent results such as those on tera-scale learning and on very large
neural networks suggest that scale is an important ingredient in quality modeling.
This tutorial introduces current applications, techniques and systems with the aim of
cross-fertilizing research between the database and machine learning communities.
The tutorial covers current large scale applications of Machine Learning, their
computational model and the workflow behind building those. Based on this
foundation, we present the current state-of-the-art in systems support in the bulk
of the tutorial. We also identify critical gaps in the state-of-the-art. This leads to the
closing of the seminar, where we introduce two sets of open research questions:
Better systems support for the already established use cases of Machine Learning
and support for recent advances in Machine Learning research
Seminar 2: Big Data Integration
11AM - 12:30PM
Odeon
Xin Luna Dong, Divesh Srivastava (AT&T Labs-Research)
The Big Data era is upon us: data is being generated, collected and analyzed at
an unprecedented scale, and data-driven decision making is sweeping through all
aspects of society. Since the value of data explodes when it can be linked and
fused with other data, addressing the big data integration (BDI) challenge is critical
to realizing the promise of Big Data. BDI differs from traditional data integration in
many dimensions:
(i) the number of data sources, even for a single domain, has grown to be in the tens
of thousands,
(ii) many of the data sources are very dynamic, as a huge amount of newly collected
data are continuously made available,
(iii) the data sources are extremely heterogeneous in their structure, with
considerable variety even for substantially similar entities, and
(iv) the data sources are of widely differing qualities, with significant differences in
the coverage, accuracy and timeliness of data provided.
This seminar explores the progress that has been made by the data integration
community on the topics of schema mapping, record linkage and data fusion in
addressing these novel challenges faced by big data integration, and identifies a range
of open problems for the community.
38
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Industry 1
11AM - 12:30pm Chair: Stelios Paparizos (Microsoft)
Concorde
Invited Talk: Big Data Analytics at Facebook
TUE/9
Ravi Murthy (Facebook)
The data analytics infrastructure at Facebook has evolved rapidly over the last few
years. The amount of data managed by the platform has grown dramatically, as
well as the increasing need to analyze the data in different ways. Several large scale
systems have been developed to ingest and crunch petabytes of data, and turn
them into insights and measurements that are used to build products and services
for Facebook's 1-billion active users. This talk will cover the overall architecture
and the design of specialized systems used for batch analytics, interactive realtime analysis and graph analytics. Each of these systems are designed with unique
tradeoffs, but integrate together to provide a comprehensive analytics platform. We
will discuss the challenges faced and lessons learnt while growing these systems
to unprecedented scale - 100s of petabytes, thousands of machines. We will also
present current challenges and opportunities that can help drive research and
innovation for the next generation of big data platforms.
Invited Paper: Data Services for E-tailers Leveraging Search Engine Assets
Tao Cheng, Kaushik Chakrabarti, Surajit Chaudhuri, Vivek Narasayya, Manoj Syamala
(Microsoft Research)
Retail is increasingly moving online. There are only a few big e-tailers but there is a
long tail of small-sized e-tailers. The big e-tailers are able to collect significant data
on user activities at their websites. They use these assets to derive insights about
their products and to provide superior experiences for their users. On the other
hand, small e-tailers do not possess such user data and hence cannot match the
rich user experiences offered by big e-tailers. Our key insight is that web search
engines possess significant data on user behaviors that can be used to help smaller
e-tailers mine the same signals that big e-tailers derive from their proprietary user
data assets. These signals can be exposed as data services in the cloud; e-tailers can
leverage them to enable similar user experiences as the big e-tailers. We present
three such data services in the paper: entity synonym data service, query-toentity data service and entity tagging data service. The entity synonyms service is
an in-production data service that is currently available while the other two are
data services currently in development at Microsoft. Our experiments on product
datasets show (i) these data services have high quality and (ii) they have significant
impact on user experiences on e-tailer websites. To the best of our knowledge, this
is the first paper to explore the potential of using search engine data assets for
e-tailers.
ICDE 2013 Conference
39
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Invited Paper: SAP HANA Distributed In-Memory Database System: Transaction,
Session, and Metadata Management
Juchang Lee, Yong Sik Kwon (SAP Labs, Korea) Franz Färber, Michael Muehle (SAP AG)
Chulwon Lee (SAP Labs, Korea) Christian Bensberg (SAP AG) Joo Yeon Lee (SAP Labs,
Korea) Arthur H. Lee (Claremont McKenna College / SAP Labs, Korea) Wolfgang Lehner
(Dresden University of Technology / SAP AG)
One of the core principles of the SAP HANA database system is the
comprehensive support of distributed query facility. Supporting scale-out scenarios
was one of the major design principles of the system from the very beginning.
Within this paper, we first give an overview of the overall functionality with respect
to data allocation, metadata caching and query routing. We then dive into some
level of detail for specific topics and explain features and methods not common
in traditional diskbased database systems. In summary, the paper provides a
comprehensive overview of distributed query processing in SAP HANA database to
achieve scalability to handle large databases and heterogeneous types of workloads.
Research 5: Uncertainty in Spatial Data
2 - 3:30PM
Chair: Wei Wang (University of New South Wales)
Ballroom 1
Voronoi-based Nearest Neighbor Search for Multi-Dimensional Uncertain
Databases
Peiwu Zhang, Reynold Cheng, Nikos Mamoulis (The University of Hong Kong) Matthias
Renz, Andreas Züfle (Ludwig-Maximilians-Universität München) Yu Tang (The University
of Hong Kong) Tobias Emrich (Ludwig-Maximilians-Universität München)
In Voronoi-based nearest neighbor search, the Voronoi cell of every point p in a
database can be used to check whether p is the closest to some query point q. We
extend the notion of Voronoi cells to support uncertain objects, whose attribute
values are inexact. Particularly, we propose the Possible Voronoi cell (or PV-cell).
A PV-cell of a multi-dimensional uncertain object o is a region R, such that for any
point p Є R, o may be the nearest neighbor of p. If the PV-cells of all objects in a
database S are known, they can be used to identify objects that have a chance to be
the nearest neighbor of q. However, there is no efficient algorithm for computing
an exact PV-cell. We hence study how to derive an axis-parallel hyper-rectangle
(called the Uncertain Bounding Rectangle, or UBR) that tightly contains a PV-cell. We
further develop the PV-index, a structure that stores UBRs, to evaluate probabilistic
nearest neighbor queries over uncertain data. An advantage of the PV- index is that
upon updates on S, it can be incrementally updated. Extensive experiments on both
synthetic and real datasets are carried out to validate the performance of the PVindex.
40
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Interval Reverse Nearest Neighbor Queries on Uncertain Data with Markov
Correlations
TUE/9
Chuanfei Xu, Yu Gu (Northeastern University) Lei Chen (Hong Kong University of Science
and Technology) Jianzhong Qiao, Ge Yu (Northeastern University)
Nowadays, many applications return to the user a set of results that take the query
as their nearest neighbor, which are commonly expressed through reverse nearest
neighbor (RNN) queries. When considering moving objects, users would like to find
objects that appear in the RNN result set for a period of time in some real-world
applications such as collaboration recommendation and anti-tracking. In this work,
we formally define the problem of interval reverse nearest neighbor (IRNN) queries
over moving objects, which return the objects that maintain nearest neighboring
relations to the moving query object for the longest time in the given interval.
Location uncertainty of moving data objects and moving query objects is inherent
in various domains, and we investigate objects that exhibit Markov correlations,
that is, each object’s location is only correlated with its own location at previous
timestamp while being independent of other objects. There exists the efficiency
challenge for answering IRNN queries on uncertain moving objects with Markov
correlations since we have to retrieve not only all the possible locations of each
object at current time but also its historically possible locations. To speed up the
query processing, we present a general framework for answering IRNN queries on
uncertain moving objects with Markov correlations in two phases. In the first phase,
we apply space pruning and probability pruning techniques, which reduce the search
space significantly. In the second phase, we verify whether each unpruned object is
an IRNN of the query object. During this phase, we propose an approach termed
Probability Decomposition Verification (PDV) algorithm which avoid computing
the probability of any object being an RNN of the query object exactly and thus
improve the efficiency of verification. The performance of the proposed algorithm
is demonstrated by extensive experiments on synthetic and real datasets, and the
experimental results show that our algorithm is more efficient than the MonteCarlo based approximate algorithm.
Efficient Tracking and Querying for Coordinated Uncertain Mobile Objects
Nicholas D. Larusso, Ambuj Singh (University of California Santa Barbara)
Accurately estimating the current positions of moving objects is a challenging task
due to the various forms of data uncertainty (e.g. limited sensor precision, periodic
updates from continuously moving objects). However, in many cases, groups of
objects tend to exhibit similarities in their movement behavior. For example,
vehicles in a convoy or animals in a herd both exhibit tightly coupled movement
behavior within the group. While such statistical dependencies often increase the
computational complexity necessary for capturing this additional structure, they
also provide useful information which can be utilized to provide more accurate
ICDE 2013 Conference
41
DETAILED PROGRAM FOR TUESDAY 9 APRIL
location estimates. In this paper, we propose a novel model for accurately tracking
coordinated groups of mobile uncertain objects. We introduce an exact and more
efficient approximate inference algorithm for updating the current location of
each object upon the arrival of new (uncertain) location observations. Additionally,
we derive probability bounds over the groups in order to process probabilistic
threshold range queries more efficiently. Our experimental evaluation shows that
our proposed model can provide 4X improvements in tracking accuracy over
competing models which do not consider group behavior. We also show that our
bounds enable us to prune up to 50% of the database, resulting in more efficient
processing over a linear scan.
Research 6: Data Extraction
TUE/9
2 - 3:30PM
Chair: Luna Dong (Google Inc.) St Germaine
Attribute Extraction and Scoring: A Probabilistic Approach
Taesung Lee (Pohang University of Science and Technology / Microsoft Research Asia)
Zhongyuan Wang (Renmin University of China / Microsoft Research Asia) Haixun
Wang (Microsoft Research Asia) Seung-won Hwang (Pohang University of Science and
Technology)
Knowledge bases, which consist of concepts, entities, attributes and relations, are
increasingly important in a wide range of applications. We argue that knowledge
about attributes (of concepts or entities) plays a critical role in inferencing. In this
paper, we propose methods to derive attributes for millions of concepts and we
quantify the typicality of the attributes with regard to their corresponding concepts.
We employ multiple data sources such as web documents, search logs, and existing
knowledge bases, and we derive typicality scores for attributes by aggregating
different distributions derived from different sources using different methods. To the
best of our knowledge, ours is the first approach to integrate concept- and instancebased patterns into probabilistic typicality scores that scale to broad concept
space. We have conducted extensive experiments to show the effectiveness of our
approach.
TYPifier: Inferring the Type Semantics of Structured Data
Yongtao Ma, Thanh Tran (Karlsruhe Institute of Technology) Veli Bicer (IBM Research
Ireland)
Structured data representing entity descriptions often lacks precise type information.
That is, it is not known to which type an entity belongs to, or the type is too
general to be useful. In this work, we propose to deal with this novel problem of
inferring the type semantics of structured data, called typification. We formulate it
as a clustering problem and discuss the features needed to obtain several solutions
based on existing clustering solutions. Because schema features perform best, but
42
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
are not abundantly available, we propose an approach to automatically derive them
from data. Optimized for the use of schema features, we present TYPifier, a novel
clustering algorithm that in experiments, yields better typification results than the
baseline clustering solutions. For entity resolution, which represents one of the
possible use cases, we show that the inferred type information helps to produce
better results.
SUSIE: Search Using Services and Information Extraction
TUE/9
Nicoleta Preda (University of Versailles) Fabian Suchanek (Max Planck Institute for
Informatics) Wenjun Yuan (University of Versailles / The University of Hong Kong) Gerhard
Weikum (Max Planck Institute for Informatics)
The API of a Web service restricts the types of queries that the service can answer.
For example, a Web service might provide a method that returns the songs of a
given singer, but it might not provide a method that returns the singers of a given
song. If the user asks for the singer of some specific song, then the Web service
cannot be called -- even though the underlying database might have the desired
piece of information. This asymmetry is an inherent limitation for systems that aim to
use Web services for service orchestration, query answering, or ontology extension.
In this paper, we propose to use on-the-fly information extraction to collect values
that can be used as parameter bindings for the Web service. As a result, a Web
service that returns songs can be used ``backwards” to find also the singers. Our
approach is fully implemented in a prototype called SUSIE. We present experiments
with real-life data and services to demonstrate the practical viability and good
performance of our approach.
Research 7: Trajectory Databases
2 - 3:30PM
Chair: Yin Yang (Advanced Digital Sciences Center)
Bastille 1
Towards Efficient Search for Activity Trajectories
Kai Zheng (The University of Queensland) Shuo Shang (Aalborg University) Nicholas Jing
Yuan (Microsoft Research Asia) Yi Yang (Carnegie Mellon University)
The advances in location positioning and wireless communication technologies have
led to a myriad of spatial trajectories representing the mobility of a variety of moving
objects. While processing trajectory data with the focus of spatio-temporal features
has been widely studied in the last decade, recent proliferation in location-based
web applications (e.g., Foursquare, Facebook) has given rise to large amounts of
trajectories associated with activity information, called activity trajectory. In this paper,
we study the problem of efficient similarity search on activity trajectory database.
Given a sequence of query locations, each associated with a set of desired activities,
an activity trajectory similarity query (ATSQ) returns $k$ trajectories that cover the
query activities and yield the shortest minimum match distance. An order-sensitive
ICDE 2013 Conference
43
DETAILED PROGRAM FOR TUESDAY 9 APRIL
activity trajectory similarity query (OATSQ) is also proposed to take into account
the order of the query locations. To process the queries efficiently, we firstly develop
a novel hybrid grid index, GAT, to organize the trajectory segments and activities
hierarchically, which enables us to prune the search space by location proximity and
activity containment simultaneously. In addition, we propose algorithms for efficient
computation of the minimum match distance and minimum order-sensitive match
distance, respectively. The results of our extensive empirical studies based on real
online check-in datasets demonstrate that our proposed index and methods are
capable of achieving superior performance and good scalability.
TUE/9
On Discovery of Gathering Patterns from Trajectories
Kai Zheng (The University of Queensland) Yu Zheng, Nicholas Jing Yuan (Microsoft
Research Asia) Shuo Shang (Aalborg University)
The increasing pervasiveness of location-acquisition technologies has enabled
collection of huge amount of trajectories for almost any kind of moving objects.
Discovering useful patterns from their movement behaviours can convey valuable
knowledge to a variety of critical applications. In this light, we propose a novel
concept, called gathering, which is a trajectory pattern modelling varies group
incidents such as celebrations, parades, protests, traffic jams and so on. A key
observation is that these incidents typically involve large congregations of individuals,
which form durable and stable areas with high density. Since the process of
discovering gathering patterns over large-scale trajectory databases can be quite
lengthy, we further develop a set of well thought out techniques to improve the
performance. These techniques, including effective indexing structures, fast pattern
detection algorithms implemented with bit vectors, and incremental algorithms for
handling new trajectory arrivals, collectively constitute an efficient solution for this
challenging task. Finally, the effectiveness of the proposed concepts and the efficiency
of the approaches are validated by extensive experiments based on a real taxicab
trajectory dataset.
Destination Prediction by Sub-Trajectory Synthesis and Privacy Protection Against
Such Prediction
44
Andy Yuan Xue, Rui Zhang (The University of Melbourne) Yu Zheng, Xing Xie (Microsoft
Research Asia) Jin Huang, Zhenghua Xu (The University of Melbourne)
Destination prediction is an essential task for many emerging location based
applications such as recommending sightseeing places and targeted advertising
based on destination. A common approach to destination prediction is to derive
the probability of a location being the destination based on historical trajectories.
However, existing techniques using this approach suffer from the ``data sparsity
problem”, i.e., the available historical trajectories is far from being able to cover
all possible trajectories. This problem considerably limits the number of query
trajectories that can obtain predicted destinations. We propose a novel method
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
named Sub-Trajectory Synthesis (SubSyn) algorithm to address the data sparsity
problem. SubSyn algorithm first decomposes historical trajectories into subtrajectories comprising two neighbouring locations, and then connects the subtrajectories into ``synthesised’’ trajectories. The number of query trajectories
that can have predicted destinations is exponentially increased by this means.
Experiments based on real datasets show that SubSyn algorithm can predict
destinations for up to ten times more query trajectories than a baseline algorithm
while the SubSyn prediction algorithm runs over two orders of magnitude faster
than the baseline algorithm. In this paper, we also consider the privacy protection
issue in case an adversary uses SubSyn algorithm to derive sensitive location
information of users. We propose an efficient algorithm to select a minimum
number of locations a user has to hide on her trajectory in order to avoid privacy
leak. Experiments also validate the high efficiency of the privacy protection
TUE/9
algorithm.
Research 8: Social Networks
2 - 3:30PM
Chair: Yuanyuan Tian (IBM Almaden)
Bastille 2
Scalable and Parallelizable Processing of Influence Maximization for Large-Scale
Social Networks
Jinha Kim, Seung-Keol Kim, Hwanjo Yu (Pohang University of Science and Technology)
As social network services connect people across the world, influence maximization,
i.e., finding the most influential nodes (or individuals) in the network, is being actively
researched with applications to viral marketing. One crucial challenge in scalable
influence maximization processing is evaluating influence, which is #P-hard and thus
hard to solve in polynomial time. We propose a scalable influence approximation
algorithm, Independent Path Algorithm (IPA) for Independent Cascade (IC) diffusion
model. IPA efficiently approximates influence by considering an independent
influence path as an influence evaluation unit. IPA are also easily parallelized
by simply adding a few lines of OpenMP meta-programming expressions. Also,
overhead of maintaining influence paths in memory is relieved by safely throwing
away insignificant influence paths. Extensive experiments conducted on largescale real social networks show that IPA is an order of magnitude faster and uses
less memory than the state of the art algorithms. Our experimental results also
show that parallel versions of IPA speeds up further as the number of CPU cores
increases, and more speed-up is achieved for larger datasets. The algorithms have
been implemented in our demo application for influence maximization (available at
http://dm.postech.ac.kr/ipa demo), which efficiently finds the most influential nodes
in a social network.
ICDE 2013 Conference
45
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
SociaLite: Datalog Extensions for Efficient Social Network Analysis
Jiwon Seo, Stephen Guo, Monica S. Lam (Stanford University)
With the rise of social networks, large-scale graph analysis becomes increasingly
important. Because SQL lacks the expressiveness and performance needed for
graph algorithms, lower-level, general-purpose languages are often used instead.
For greater ease of use and efficiency, we propose SociaLite, a high-level graph
query language based on Datalog. As a logic programming language, Datalog allows
many graph algorithms to be expressed succinctly. However, its performance has
not been competitive when compared to low-level languages. With SociaLite, users
can provide high-level hints on the data layout and evaluation order; they can also
define recursive aggregate functions which, as long as they are meet operations, can
be evaluated incrementally and efficiently. We evaluated SociaLite by running eight
graph algorithms (shortest paths, PageRank, hubs and authorities, mutual neigh- bors,
connected components, triangles, clustering coefficients, and betweenness centrality)
on two real-life social graphs, LiveJournal and Last.fm. The optimizations proposed
in this paper speed up almost all the algorithms by 3 to 22 times. SociaLite even
outperforms typical Java implementations by an average of 50% for the graph
algorithms tested. When compared to highly optimized Java implementations,
SociaLite programs are an order of magnitude more succinct and easier to write.
Its performance is competitive, giving up only 16% for the largest benchmark. Most
importantly, being a query language, SociaLite enables many more users who are
not proficient in software engineering to make social network queries easily and
efficiently.
LinkProbe: Probabilistic Inference on Large-Scale Social Networks
Haiquan Chen (Valdosta State University) Wei-Shinn Ku (Auburn University) Haixun
Wang (Microsoft Research Asia) Liang Tang (Auburn University) Min-Te Sun (National
Central University, Taiwan)
As one of the most important Semantic Web applications, social network analysis
has attracted more and more interests from researchers due to the rapidly
increasing availability of massive social network data. A desired solution for social
network analysis should address the following issues. First, in many real world
applications, inference rules are partially correct. An ideal solution should be able
to handle partially correct rules. Second, applications in practice often involve large
amounts of data. The inference mechanism should scale up towards large-scale
data. Third, inference method should take into account probabilistic evidence data
because these are apparently domains abounding with uncertainty. Various solutions
for social network analysis have been around for quite a few years; however,
none of them support all the aforementioned features. In this paper, we design
and implement LinkProbe, a prototype to quantitatively predict existence of links
among nodes in large-scale social networks, which is empowered by Markov Logic
46
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Networks (MLNs). MLN has been proved to be an effective inference model which
can handle complex dependencies and partially correct rules. More important,
although MLN has shown acceptable performance in prior works, MLN is also
reported as impractical in handling large-scale data due to its highly demanding
nature in terms of inference time and memory consumption. In order to overcome
these limitations, LinkProbe retrieves the $k$-backbone graphs and conducts the
MLN inference on both the most globally influencing nodes and most locally related
nodes. Our extensive experiments show that LinkProbe manages to provide a
tunable balance between MLN inference accuracy and inference efficiency.
Seminar 3: Workload Management for Big Data Analytics
TUE/9
2 - 3:30PM
Odeon
Ashraf Aboulnaga (University of Waterloo) Shivnath Babu (Duke University)
Parallel database systems and MapReduce systems (most notably Hadoop) are
essential components of today’s infrastructure for Big Data analytics. These systems
process multiple concurrent workloads consisting of complex user requests, where
each request is associated with an (explicit or implicit) service level objective. For
example, the workload of a particular user or application may have a higher priority
than other workloads. Or a particular workload may have strict deadlines for the
completion of its requests. The research area of Workload Management focuses on
ensuring that the system meets the service level objectives of various requests while
at the same time minimizing the resources required to achieve this goal. At a high
level, workload management can be viewed as looking beyond the performance
of an individual request to the performance of an entire workload consisting
of multiple requests. This tutorial will discuss the fundamentals of workload
management, and present tools and techniques for workload management in parallel
databases and MapReduce.
Industry 2
2 - 3:30PM
Chair: Hakan Hacıgümüş (NEC Laboratories America) Concorde
Invited Paper: HFMS: Managing the Lifecycle and Complexity of Hybrid Analytic
Data Flows
Alkis Simitsis, Kevin Wilkinson, Umeshwar Dayal, Meichun Hsu (Hewlett-Packard
Laboratories)
To remain competitive, enterprises are evolving their business intelligence systems
to provide dynamic, near real-time views of business activities. To enable this, they
deploy complex workflows of analytic data flows that access multiple storage
repositories and execution engines and that span the enterprise and even outside
the enterprise. We call these multi-engine flows hybrid flows. Designing and
optimizing hybrid flows is a challenging task. Managing a workload of hybrid flows
ICDE 2013 Conference
47
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
is even more challenging since their execution engines are likely under different
administrative domains and there is no single point of control. To address these
needs, we present a Hybrid Flow Management System (HFMS). It is an independent
software layer over a number of independent execution engines and storage
repositories. It simplifies the design of analytic data flows and includes optimization
and executor modules to produce optimized executable flows that can run across
multiple execution engines. HFMS dispatches flows for execution and monitors
their progress. To meet service level objectives for a workload, it may dynamically
change a flow’s execution plan to avoid processing bottlenecks in the computing
infrastructure. We present the architecture of HFMS and describe its components.
To demonstrate its potential benefit, we describe performance results for running
sample batch workloads with and without HFMS. The ability to monitor multiple
execution engines and to dynamically adjust plans enables HFMS to provide better
service guarantees and better system utilization.
48
Invited Paper: KuaFu: Closing the Parallelism Gap in Database Replication
Chuntao Hong (Microsoft Research Asia) Dong Zhou (Tsinghua University) Mao Yang
(Microsoft Research Asia) Carbo Kuo (Tsinghua University) Lintao Zhang, Lidong Zhou
(Microsoft Research Asia)
Database systems are nowadays increasingly deployed on multi-core commodity
servers, with replication to guard against failures. On one hand, a database engine
is best de- signed to scale with the number of cores and to offer a high degree
of parallelism on a modern multi-core architecture. On the other hand, replication
traditionally resorts to a certain form of serialization for data consistency among
replicas. In the widely used primary/backup replication with log shipping, concurrent
executions on the primary and the serialized log replay on a backup creates a
serious parallelism gap. Our experiments with MySQL, a popular open-source
database system, shows that on a 16 core configuration the serial replay on a
backup can sustain less than one third of the throughput achievable on the primary
under a TPC-C- like OLTP workload. This paper proposes KuaFu to close the
parallelism gap on replicated database systems by enabling concurrent replay
of transactions on a backup. KuaFu maintains write consistency on backups by
tracking transaction dependencies. Concurrent replay on a backup does introduce
read inconsistency between the primary and a backup. KuaFu further leverages
multi-version concurrency control to pro- duce snapshots in order to restore the
consistency semantics. We have implemented KuaFu with MySQL; our evaluation- s
show that KuaFu allows a backup to keep up with the primary while preserving
replication consistency.
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Materialization Strategies in the Vertica Analytic Database: Lessons Learned
TUE/9
Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek
Bharathan, Chuck Bear (Vertica Systems, an HP Company)
Column store databases allow for various tuple reconstruction strategies (also called
materialization strategies). Early materialization is easy to implement but generally
performs worse than late materialization. Late materialization is more complex to
implement, and usually performs much better than early materialization, although
there are situations where it is worse. We identify these situations, which essentially
revolve around joins where neither input fits in memory (also called spilling joins).
Sideways information passing techniques provide a viable solution to get the best of
both worlds. We demonstrate how early materialization combined with sideways
information passing allows us to get the benefits of late materialization, without the
bookkeeping complexity or worse performance for spilling joins. It also provides
some other benefits to query processing in Vertica due to positive interaction with
compression and sort orders of the data. In this paper, we report our experiences
with late and early materialization, highlight the strengths and weaknesses of them,
and present the details of our sideways information passing implementation. We
show experimental results of comparing these materialization strategies, which
highlight the significant performance improvements provided by our implementation
of sideways information passing (up to 72% on some TPC-H queries).
Demo Groups 1 & 2
2 - 3:30PM
Ballroom 2
Twitter+: Build Personalized Newspaper For Twitter
Chen Liu, Anthony K. H. Tung (National University of Singapore)
Nowadays, microblogging services, e.g., Twitter, have played important roles in
people’s everyday lives. It enables users to publish and read text-based posts, known
as ``tweets” and interact with each other through re-tweeting or commenting. In
the literature, many efforts have been devoted on exploiting the social property
of Twitter. However, except the social component, Twitter itself has become an
indispensable source for users to acquire useful information. To maximize its value,
we expect to pay more attention on the media property of Twitter. To be good
media, the first requirement is that it should provide an effective presentation of
its news so that users are facilitated of reading. Currently, all tweets from followings
are presented to the users and usually organized by their published timelines or
coming sources. However, too few dimensions of presenting tweets hinder users
from finding their interested information conveniently. In this demo, we presents
``Twitter+’’, which aims to enrich user’s reading experiences in Twitter by providing
multiple ways for them to explore tweets, such as keyword presentation, topic
finding. It presents users an alternative interface to browse tweets more effectively.
ICDE 2013 Conference
49
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
A Generic Database Benchmarking Service
50
Martin Kaufmann (ETH Zürich / SAP AG) Peter M. Fischer (Albert-Ludwigs-Universität
Freiburg) Donald Kossmann (ETH Zürich) Norman May (SAP AG)
Benchmarks are widely applied for the development and optimization of database
systems. Standard benchmarks such as TPC-C and TPC-H provide a way of
comparing the performance of different systems. In addition, micro benchmarks can
be exploited to test a specific behavior of a system. Yet, despite all the benefits that
can be derived from benchmark results, the effort of implementing and executing
benchmarks remains prohibitive: Database systems need to be set up, a large
number of artifacts such as data generators and queries need to be managed and
complex, time-consuming operations have to be orchestrated. In this demo, we
introduce a generic benchmarking service that combines a rich meta model, low
marginal cost and ease of use, which drastically reduces the time and cost to define,
adapt and run a benchmark.
Aeolus: An Optimizer for Distributed Intra-Node-Parallel Streaming Systems
Matthias J. Sax (Humboldt-Universität zu Berlin) Malu Castellanos, Qiming Chen,
Meichun Hsu (Hewlett-Packard Laboratories)
Aeolus is a prototype implementation of a topology optimizer on top of the
distributed streaming system Storm. Aeolus extends Storm with a batching
layer which can increase the topology’s throughput by more than one order
of magnitude. Furthermore, Aeolus implements an optimization algorithm that
computes the optimal batch size and degree of parallelism for each node in the
topology automatically. Even if Aeolus is built on top of Storm, the developed
concepts are not limited to Storm and can be applied to any distributed intranode-parallel streaming system. We propose to demo Aeolus using an interactive
Web UI. One part of the Web UI is a topology builder allowing the user to interact
with the system. Topologies can be created from scratch and their structure and/or
parameters can be modified. Furthermore, the user is able to observe the impact
of the changes on the optimization decisions and runtime behavior. Additionally, the
Web UI gives a deep insight in the optimization process by visualizing it. The user
can interactively step through the optimization process while the UI shows the
optimizer’s state, computations, and decisions. The Web UI is also able to monitor
the execution of a non-optimized and optimized topology simultaneously showing
the advantage of using Aeolus.
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Crowd-Answering System via Microblogging
TUE/9
Xianke Zhou, Ke Chen, Sai Wu, Bingbing Zhang (Zhejiang University)
Most crowdsourcing systems leverage the public platforms, such as Amazon
Mechanical Turk (AMT), to publish their jobs and collect the results. They are
charged for using the platform’s service and they are also required to pay the
workers for each successful job. Although the average wage of the online human
worker is not high, for a 24x7 running service, the crowdsourcing system becomes
very expensive to maintain. We observe that there are, in fact, many sources that
can provide free online human volunteers. Microblogging system is one of the
most promising human resources. In this paper, we present our CrowdAnswer
system, which is built on top of Weibo, the largest microblogging system in China.
CrowdAnswer is a question-answering system, which distributes various questions
to different groups of microblogging users adaptively. The answers are then collected
from those users’ tweets and visualized for the question originator. CrowdAnswer
maintains a virtual credit system. The users need credits to publish questions and
they can gain credits by answering the questions. A novel algorithm is proposed to
route the questions to the interested users, which tries to maximize the probability
of successfully answering a question.
With a Little Help from My Friends
Arnab Nandi (The Ohio State University) Stelios Paparizos, John C. Shafer, Rakesh Agrawal
(Microsoft Research)
A typical person has numerous online friends that, according to studies, the person
often consults for opinions and advice. However, public broadcasting a question
to all friends risks social capital when repeated too often, is not tolerant to topic
sensitivity, and can result in no response, as the message is lost in a myriad of status
updates. Direct messaging is more personal and avoids these pitfalls, but requires
manual selection of friends to contact, which can be time consuming and challenging.
A user may have difficulty guessing which of their numerous online friends can
provide a high quality and timely response. We demonstrate a working system
that addresses these issues by returning an ordered subset of friends predicting (a)
near-term availability, (b) willingness to respond and (c) topical knowledge, given a
query. The combination of these three aspects are unique to our solution, and all are
critical to the problem of obtaining timely and relevant responses. Our system acts
as a decision aid -- we give insight into why each friend was recommended and let
the user decide whom to contact.
ICDE 2013 Conference
51
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Peeking into the Optimization of Data Flow Programs with MapReduce-style
UDFs
Fabian Hueske (Technische Universität Berlin) Mathias Peters (Humboldt Universität zu
Berlin) Aljoscha Krettek, Matthias Ringwald, Kostas Tzoumas, Volker Markl (Technische
Universität Berlin) Johann-Christoph Freytag (Humboldt Universität zu Berlin)
Data flows are a popular abstraction to define data-intensive processing tasks. In
order to support a wide range of use cases, many data processing systems feature
MapReduce-style user-defined functions (UDFs). In contrast to UDFs as known
from relational DBMS, MapReduce-style UDFs have less strict templates. These
templates do not alone provide all the information needed to decide whether they
can be reordered with relational operators and other UDFs. However, it is wellknown that reordering operators such as filters, joins, and aggregations can yield
runtime improvements by orders of magnitude. We demonstrate an optimizer for
data flows that is able to reorder operators with MapReduce-style UDFs written
in an imperative language. Our approach leverages static code analysis to extract
information from UDFs which is used to reason about the reorderbility of UDF
operators. This information is sufficient to enumerate a large fraction of the search
space covered by conventional RDBMS optimizers including filter and aggregation
push-down, bushy join orders, and choice of physical execution strategies based on
interesting properties. We demonstrate our optimizer and a job submission client
that allows users to peek step-by-step into each phase of the optimization process:
the static code analysis of UDFs, the enumeration of reordered candidate data
flows, the generation of physical execution plans, and their parallel execution. For
the demonstration, we provide a selection of relational and non-relational data flow
programs which highlight the salient features of our approach.
Very Fast Estimation for Result and Accuracy of Big Data Analytics: the EARL
System
Nikolay Laptev, Kai Zeng, Carlo Zaniolo (University of California, Los Angeles)
Approximate results based on samples often provide the only way in which
advanced analytical applications on very massive data sets (a.k.a. ‘big data’) can
satisfy their time and resource constraints. Unfortunately, methods and tools
for the computation of accurate early results are currently not supported in big
data systems (e.g., Hadoop). Therefore, we propose a nonparametric accuracy
estimation method and system to speed-up big data analytics. Our framework is
called EARL (Early Accurate Result Library) and it works by predicting the learning
curve and choosing the appropriate sample size for achieving the desired error
bound specified by the user. The error estimates are based on a technique called
bootstrapping that has been widely used and validated by statisticians, and can
be applied to arbitrary functions and data distributions. Therefore, this demo will
elucidate (a) the functionality of EARL and its intuitive GUI interface whereby
52
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
first-time users can appreciate the accuracy obtainable from increasing sample sizes
by simply viewing the learning curve displayed by EARL, (b) the usability of EARL,
whereby conference participants can interact with the system to quickly estimate
the sample sizes needed to obtain the desired accuracies or response times, and
then compare them against the accuracies and response times obtained in the actual
computations.
Road Network Mix-zones for Anonymous Location Based Services
TUE/9
Balaji Palanisamy, Sindhuja Ravichandran, Ling Liu, Binh Han, Kisung Lee, Calton Pu
(Georgia Institute of Technology)
We present MobiMix, a road network based mix-zone framework to protect
location privacy of mobile users traveling on road networks. An alternative and
complementary approach to spatial cloaking based location privacy protection is
to break the continuity of location exposure by introducing techniques, such as
mix-zones, where no applications can trace user movements. However, existing
mix-zone proposals fail to provide effective mix-zone construction and placement
algorithms that are resilient to timing and transition attacks. In MobiMix, mix-zones
are constructed and placed by carefully taking into consideration of multiple factors,
such as the geometry of the zones, the statistical behavior of the user population,
the spatial constraints on movement patterns of the users, and the temporal
and spatial resolution of the location exposure. In this demonstration, we first
introduce a visualization of the location privacy risks of mobile users traveling on
road networks and show how mix-zone based anonymization breaks the continuity
of location exposure to protect user location privacy. We demonstrate a suite of
road network mix-zone construction and placement methods that provide higher
level of resilience to timing and transition attacks on road networks. We show the
effectiveness of the MobiMix approach through detailed visualization using traces
produced by GTMobiSim on different scales of geographic maps.
Query Time Scaling of Attribute Values in Interval Timestamped Databases
Anton Dignös, Michael Böhlen (University of Zurich) Johann Gamper (Free University of
Bozen-Bolzano)
In valid-time databases with interval timestamping each tuple is associated with
a time interval over which the recorded fact is true in the modeled reality. The
adjustment of these intervals is an essential part of processing interval timestamped
data. Some attribute values remain valid if the associated interval changes, whereas
others have to be scaled along with the time interval. For example, attributes that
record total (cumulative) quantities over time, such as project budgets, total sales or
total costs, often must be scaled if the timestamp is adjusted. The goal of this demo
is to show how to support the scaling of attribute values in SQL at query time.
ICDE 2013 Conference
53
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Extracting Interesting Related Context-dependent Concepts from Social Media
Streams using Temporal Distributions
Craig P. Sayers, Meichun Hsu (Hewlett-Packard Labs)
To enable the interactive exploration of large social media datasets we exploit the
temporal distributions of word n-grams within the message stream to discover
“interesting” concepts, determine “relatedness” between concepts, and find
representative examples for display. We present a new algorithm for contextdependent “interestingness” using the coefficient of variation of the temporal
distribution, apply the well-known technique of Pearson’s Correlation to tweets
using equi-height histogramming to determine correlation, and employ an
asymmetric variant for computing “relatedness” to encourage exploration. We
further introduce techniques using interestingness, correlation, and relatedness to
automatically discover concepts and select preferred word N-grams for display.
These techniques are demonstrated on an 800,000 tweet dataset from the
Academy Awards.
VERDICT: Privacy-Preserving Authentication of Range Queries in Location-based
Services
Haibo Hu, Qian Chen, Jianliang Xu (Hong Kong Baptist University)
We demonstrate VERDICT, a location-based range query service featuring the
privacy-preserving authentication capability. VERDICT adopts the common dataas-a-service (DaaS) model, which consists of the data owner (a location registry
or a mobile operator) who provides the querying data, the service provider who
executes the query, and the querying users. The system features a privacy-preserving
query authentication module that enables the user to verify the correctness of
results while still protecting the data privacy. This feature is crucial in many locationbased services where the querying data are user locations. To achieve this, VERDICT
employs an MR-tree based privacy-preserving authentication scheme proposed in
our earlier work. The use case study shows that VERDICT provides efficient and
smooth user experience for authenticating location-based range queries.
ExpFinder: Finding Experts by Graph Pattern Matching
Wenfei Fan (The University of Edinburgh / Beihang University) Xin Wang (The University
of Edinburgh) Yinghui Wu (University of California Santa Barbara)
We present ExpFinder, a system for finding experts in social networks based on
graph pattern matching. We demonstrate (1) how ExpFinder identifies top-K
experts in a social network by supporting bounded simulation of graph patterns,
and by ranking the matches based on a metric for social impact; (2) how it copes
with the sheer size of real-life social graphs by supporting incremental query
evaluation and query preserving graph compression, and (3) how the GUI of
ExpFinder interacts with users to help them construct queries and inspect matches.
54
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
Tajo: A Distributed Data Warehouse System on Large Clusters
TUE/9
Hyunsik Choi, Jihoon Son, Haemi Yang, Hyoseok Ryu, Byungnam Lim, Soohyung Kim, Yon
Dohn Chung (Korea University)
The increasing volumes of relational data let us find an alternative to cope with
them. Recently, several hybrid approaches (e.g., HadoopDB and Hive) between
parallel databases and Hadoop have been introduced to the database community.
Although these hybrid approaches have gained wide popularity, they cannot avoid
the choice of suboptimal execution strategies. We believe that this problem is
caused by the inherent limits of their architectures. In this demo, we present Tajo,
a relational, distributed data warehouse system on shared-nothing clusters. It uses
Hadoop Distributed File System (HDFS) as the storage layer and has its own query
execution engine that we have developed instead of the MapReduce framework. A
Tajo cluster consists of one master node and a number of workers across cluster
nodes. The master is mainly responsible for query planning and the coordinator
for workers. The master divides a query into small tasks and disseminates them
to workers. Each worker has a local query engine that executes a directed acyclic
graph of physical operators. A DAG of operators can take two or more input
sources and be pipelined within the local query engine. In addition, Tajo can control
distributed data flow more flexible than that of MapReduce and supports indexing
techniques. By combining these features, Tajo can employ more optimized and
efficient query processing, including the existing methods that have been studied
in the traditional database research areas. To give a deep understanding of the
Tajo architecture and behavior during query processing, the demonstration will
allow users to submit TPC-H queries to 32 Tajo cluster nodes. The web-based user
interface will show (1) how the submitted queries are planned, (2) how the query
are distributed across nodes, (3) the cluster and node status, and (4) the detail of
relations and their physical information. Also, we provide the performance evaluation
of Tajo compared with Hive.
Research 9: Indexing Structures
4 - 5:30PM
Chair: Rui Zhang (University of Melbourne)
Ballroom 1
The Bw-Tree: A B-tree for New Hardware Platforms
Justin Levandoski, David B. Lomet, Sudipta Sengupta (Microsoft Research)
The emergence of new hardware and platforms has led to reconsideration of how
data management systems are designed. However, certain basic functions such as
key indexed access to records remain essential. While we exploit the common
architectural layering of prior systems, we make radically new design decisions
about each layer. Our new form of B-tree, called the Bw-tree achieves its very
high performance via a latch-free approach that effectively exploits the processor
caches of modern multi-core chips. Our storage manager uses a unique form of log
ICDE 2013 Conference
55
DETAILED PROGRAM FOR TUESDAY 9 APRIL
structuring that blurs the distinction between a page and a record store and works
well with flash storage. This paper describes the architecture and algorithms for
the Bw-tree, focusing on the main memory aspects. The paper includes results of
our experiments that demonstrate that this fresh approach produces outstanding
performance.
TUE/9
Secure and Efficient Range Queries on Outsourced Databases Using ˆR-Trees
Peng Wang, Chinya V. Ravishankar (University of California-Riverside)
We show how to execute range queries securely and efficiently on encrypted
databases in the cloud. Current methods provide either security or efficiency, but
not both. Many schemes even reveal the ordering of encrypted tuples, which, as we
show, allows adversaries to estimate plaintext values accurately. We present the ˆRTrees a hierarchical encrypted index that may be securely placed in the cloud, and
searched efficiently. It is based on a mechanism we design for encrypted halfspace
range queries in Rd , using Asymmetric Scalar-product Preserving Encryption. Data
owners can tune the ˆR-Trees parameters to achieve desired security-efficiency
tradeoffs. We also present extensive experiments to evaluate ˆR-Trees performance.
Our results show that ˆR-Trees queries are efficient on encrypted databases, and
reveal far less information than competing methods.
An Efficient and Compact Indexing Scheme for Large-scale Data Store
56
Peng Lu (National University of Singapore) Sai Wu, Lidan Shou (Zhejiang University)
Kian-Lee Tan (National University of Singapore)
The amount of data managed in today’s Cloud systems has reached an
unprecedented scale. In order to speed up query processing, an effective mechanism
is to build indexes on attributes that are used in query predicates. However,
conventional indexing schemes fail to provide a scalable service: as the size of these
indexes are proportional to the data size, it is not space efficient to build many
indexes. As such, it becomes more crucial to develop effective index to provide
scalable database services in the Cloud. In this paper, we propose a compact bitmap
indexing scheme for a large-scale data store. The bitmap indexing scheme combines
state-of-the-art bitmap compression techniques, such as WAH encoding and bitsliced encoding. To further reduce the index cost, a novel and query efficient partial
indexing technique is adopted, which dynamically refreshes the index to handle
updates and process queries. The intuition of our indexing approach is to maximize
the number of indexed attributes, so that a wider range of queries, including range
and join queries, can be efficiently supported. Our indexing scheme is light-weight
and its creation can be seamlessly grafted onto the MapReduce processing engine
without incurring significant running cost. Moreover, the compactness allows us to
maintain the bitmap indexes in memory so that performance overhead of index
access is minimal. We implement our indexing scheme on top of the underlying
Distributed File System (DFS) and evaluate its performance on an in-house cluster.
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
We compare our index-based query processing with HadoopDB to show its
superior performance. Our experimental results confirm the effectiveness, efficiency
and scalability of the indexing scheme.
Research 10: Main Memory Query Processing
4 - 5:30PM
Chair: Paul Larson (Microsoft)
St Germaine
Recycling in Pipelined Query Evaluation
TUE/9
Fabian Nagel (The University of Edinburgh) Peter Boncz (Centrum Wiskunde &
Informatica) Stratis Viglas (The University of Edinburgh)
Database systems typically execute queries in isolation. Sharing recurring
intermediate and final results between successive query invocations is ignored
or only exploited by caching final query results. The DBA is kept in the loop to
make explicit sharing decisions by identifying and/or defining materialized views.
Thus decisions are made only after a long time and sharing opportunities may be
missed. Recycling intermediate results has been proposed as a method to make
database query engines profit from opportunities to reuse fine-grained partial
query results, that is fully autonomous and is able to continuously adapt to changes
in the workload. The technique was recently revisited in the context of MonetDB, a
system that by default materializes all intermediate results. Materializing intermediate
results can consume significant system resources, therefore most other database
systems avoid this where possible, following a pipelined query architecture instead.
The novelty of this paper is to show how recycling can successfully be applied
in pipelined query executors, by tracking the benefit of materializing possible
intermediate results and then choosing the ones making best use of a limited
intermediate result cache. We present ways to maximize the potential of recycling
by leveraging subsumption and proactive query rewriting. We have implemented
our approach in the Vectorwise database engine and have experimentally evaluated
its potential using both synthetic and real-world datasets. Our results show that
intermediate result recycling significantly improves performance.
Efficient Many-Core Query Execution in Main Memory Column-Stores
Jonathan Dees (SAP AG / Karlsruhe Institute of Technology) Peter Sanders (Karlsruhe
Institute of Technology)
We use the full query set of the TPC-H Benchmark as a case study for the
efficient implementation of decision support queries on main memory columnstore databases. Instead of splitting a query into separate independent operators,
we consider the query as a whole and translate the execution plan into a single
function performing the query. This allows highly efficient CPU utilization, minimal
materialization, and execution in a single pass over the data for most queries. The
single pass is performed in parallel and scales near-linearly with the number of
ICDE 2013 Conference
57
DETAILED PROGRAM FOR TUESDAY 9 APRIL
cores. The resulting query plans for most of the 22 queries are remarkably simple
and are suited for automatic generation and fast compilation. Using a data-parallel,
NUMA-aware many-core implementation with block summaries, inverted index
data structures, and efficient aggregation algorithms, we achieve one to two orders
of magnitude better performance than the current record holders of the TPC-H
Benchmark.
TUE/9
Main-Memory Hash Joins on Multi-Core CPUs: Tuning to the Underlying
Hardware
Cagri Balkesen, Jens Teubner, Gustavo Alonso (ETH Zürich) M. Tamer Özsu (University of
Waterloo)
The architectural changes introduced with multi-core CPUs have triggered a
redesign of main-memory join algorithms. In the last few years, two diverging
views have appeared. One approach advocates careful tailoring of the algorithm
to the architectural parameters (cache sizes, TLB, and memory bandwidth). The
other approach argues that modern hardware is good enough at hiding cache and
TLB miss latencies and, consequently, the careful tailoring can be omitted without
sacrificing performance. In this paper we demonstrate through experimental analysis
of different algorithms and architectures that hardware still matters. Join algorithms
that are hardware conscious perform better than hardware-oblivious approaches.
The analysis and comparisons in the paper show that many of the claims regarding
the behavior of join algorithms that have appeared in literature are due to selection
effects (relative table sizes, tuple sizes, the underlying architecture, using sorted
data, etc.) and are not supported by experiments run under different parameters
settings. Through the analysis, we shed light on how modern hardware affects the
implementation of data operators and provide the fastest implementation of radix
join to date, reaching close to 200 million tuples per second.
Research 11: Data Mining I
4 - 5:30 PM Chair: Gautam Das (Qatar Computing Research Institute)
Bastille 1
Coupled Clustering Ensemble: Incorporating Coupling Relationships Both
between Base Clusterings and Objects
Can Wang, Zhong She, Longbing Cao (University of Technology Sydney)
Clustering ensemble is a powerful approach for improving the accuracy and
stability of individual (base) clustering algorithms. Most of the existing clustering
ensemble methods obtain the final solutions by assuming that base clusterings
perform independently with one another and all objects are independent too.
However, in real-world data sources, objects are more or less associated in terms
of certain coupling relationships. Base clusterings trained on the source data
are complementary to one another since each of them may only capture some
58
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
specific rather than full picture of the data. In this paper, we discuss the problem
of explicating the dependency between base clusterings and between objects in
clustering ensembles, and propose a framework for coupled clustering ensembles
(CCE). CCE not only considers but also integrates the coupling relationships
between base clusterings and between objects. Specifically, we involve both the
intra-coupling within one base clustering (i.e., cluster label frequency distribution)
and the inter-coupling between different base clusterings (i.e., cluster label cooccurrence dependency). Furthermore, we engage both the intra-coupling between
two objects in terms of the base clustering aggregation and the inter-coupling
among other objects in terms of neighborhood relationship. This is the first work
which explicitly addresses the dependency between base clusterings and between
objects, verified by the application of such couplings in three types of consensus
functions: clustering-based, object-based and cluster-based. Substantial experiments
on synthetic and UCI data sets demonstrate that the CCE framework can effectively
capture the interactions embedded in base clusterings and objects with higher
clustering accuracy and stability compared to several state-of-the-art techniques,
which is also supported by statistical analysis.
Focused Matrix Factorization For Audience Selection in Display Advertising
Bhargav Kanagal, Amr Ahmed (Google Inc.) Sandeep Pandey (Twitter Inc.) Vanja Josifovski,
Lluis Garcia-Pueyo (Google Inc.) Jeff Yuan (Yahoo! Research)
Audience selection is a key problem in display advertising systems in which we need
to select a list of users who are interested (i.e., most likely to buy) in an advertising
campaign. The users' past feedback on this campaign can be leveraged to construct
such a list using collaborative filtering techniques such as matrix factorization.
However, the user-campaign interaction is typically extremely sparse, hence the
conventional matrix factorization does not perform well. Moreover, simply combining
the users feedback from all campaigns does not address this since it dilutes the
focus on target campaign in consideration. To resolve these issues, we propose a
novel focused matrix factorization model (FMF) which learns users' preferences
towards the specific campaign products, while also exploiting the information
about related products. We exploit the product taxonomy to discover related
campaigns, and design models to discriminate between the users' interest towards
campaign products and non-campaign products. We develop a parallel multi-core
implementation of the FMF model and evaluate its performance over a real-world
advertising dataset spanning more than a million products. Our experiments
demonstrate the benefits of using our models over existing approaches.
ICDE 2013 Conference
59
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Graph Stream Classification using Labeled and Unlabeled Graphs
Shirui Pan, Xingquan Zhu, Chengqi Zhang (University of Technology Sydney) Philip S. Yu
(University of Illinois at Chicago)
Graph classification is becoming increasingly popular due to the rapidly rising
applications involving data with structural dependency. The wide spread of the graph
applications and the inherent complex relationships between graph objects have
made the labels of the graph data expensive and/or difficult to obtain, especially
for applications involving dynamic changing graph records. While labeled graphs
are limited, the copious amounts of unlabeled graphs are often easy to obtain
with trivial efforts. In this paper, we propose a framework to build a stream based
graph classification model by combining both labeled and unlabeled graphs. Our
method, called gSLU, employs an ensemble-based framework to partition graph
streams into a number of graph chunks each containing some labeled and unlabeled
graphs. For each individual chunk, we propose a minimum-redundancy subgraph
feature selection module to select a set of informative subgraph features to build a
classifier. To tackle the concept drifting in graph streams, an instance level weighting
mechanism is used to dynamically adjust the instance weight, through which the
subgraph feature selection can emphasize on difficult graph samples. The classifiers
built from different graph chunks form an ensemble for graph stream classification.
Experiments on real-world graph streams demonstrate clear benefits of using
minimum-redundancy subgraph features to build accurate classifiers. By employing
instance level weighting, our graph ensemble model can effectively adapt to the
concept drifting in the graph stream for classification.
Research 12: Moving Objects
4 - 5:30PM
Chair: Shuo Shang (Aalborg University)
Bastille 2
T-Share: A Large-Scale Dynamic Taxi Ridesharing Service
Shuo Ma (University of Illinois at Chicago / Microsoft Research Asia) Yu Zheng (Microsoft
Research Asia) Ouri Wolfson (University of Illinois at Chicago / Microsoft Research Asia)
Taxi ridesharing can be of significant social and environmental benefit, e.g. by saving
energy consumption and satisfying more people’s commute needs in peak hours.
Despite the great potential, taxi ridesharing, especially with dynamic queries, is not
well studied. In this paper, we formally define the dynamic ridesharing problem and
propose a large-scale taxi ridesharing service. It efficiently serves real-time requests
sent by taxi users and generates ridesharing schedules that reduce the total travel
distance significantly. In our method, we first propose a taxi searching algorithm using
a spatio-temporal index to quickly retrieve candidate taxis that are likely to satisfy
a user query. A scheduling algorithm is then proposed. It checks each candidate taxi
and inserts the query’s trip into the schedule of the taxi which satisfies the query
with minimum additional incurred travel distance. To tackle the heavy computational
60
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
load, a lazy shortest path calculation strategy is devised to speed up the scheduling
algorithm. We evaluated our service using a GPS trajectory dataset generated by
over 33,000 taxis during a period of 3 months. By learning the spatio-temporal
distributions and the stochastic process of real user queries from this dataset, we
built an experimental platform that can simulate user behaviours in taking a taxi in
the real-world. Tested on this platform with extensive experiments, our approach
demonstrated its efficiency, effectiveness, and scalability. For example, our proposed
service can serve 25% additional taxi users while saving 13% travel distance
compared with no-ridesharing (when the ratio of the number of queries to the
number of taxis is 6).
Efficient Notification of Meeting Points for Moving Groups via Independent Safe
Regions
TUE/9
Jing Li (The University of Hong Kong) Man Lung Yiu (Hong Kong Polytechnic University)
Nikos Mamoulis (The University of Hong Kong)
In applications like social networking services and online games, multiple moving
users form a group and wish to be continuously notified with the best meeting
point from their locations. A promising technique for reducing the communication
frequency of the application server is to apply safe regions, which capture the validity
of query results with respect to the users’ locations. Unfortunately, the safe regions
in our problem exhibit characteristics such as irregular shapes and dependency
among multiple safe regions. These unique characteristics render existing safe region
methods that focus on a single safe region inapplicable to our problem. To tackle
these challenges, we first examine the shapes of safe regions in our problem context
and propose feasible approximations for them. We design efficient algorithms
for computing these safe regions, as well as develop compression techniques for
representing safe regions in a compact manner. Experiments with both real and
synthetic data demonstrate the efficiency of our proposal in terms of computation
and communication costs.
Efficient Distance-Aware Query Evaluation on Indoor Moving Objects
Xike Xie, Hua Lu, Torben Bach Pedersen (Aalborg University)
Indoor spaces accommodate large parts of people’s life. The increasing availability of
indoor positioning, driven by technologies like Wi-Fi, RFID, and Bluetooth, enables
a variety of indoor location-based services (LBSs). Efficient indoor distance-aware
queries on indoor moving objects play an important role in supporting and boosting
such LBSs. However, the distance-aware query evaluation on indoor moving objects
is challenging because: (1) indoor spaces are characterized by many special entities
and thus render distance calculation very complex; (2) the limitations of indoor
positioning technologies create inherent uncertainties in indoor moving objects data.
In this paper, we propose a complete set of techniques for efficient distance-aware
queries on indoor moving objects. We define and categorize the indoor distances in
ICDE 2013 Conference
61
DETAILED PROGRAM FOR TUESDAY 9 APRIL
relation to indoor uncertain objects, and derive different distance bounds that can
facilitate query evaluation. Existing works often assume indoor floor plans are static,
and require extensive pre-computation on indoor topologies. In contrast, we design
a composite index scheme that integrates indoor geometries, indoor topologies, as
well as indoor uncertain objects, and thus supports indoor distance-aware queries
efficiently without time-consuming and volatile distance computation. We design
algorithms for range query and k nearest neighbor query on indoor moving objects.
The results of extensive experimental studies demonstrate that our proposals
are efficient and scalable in evaluating distance-aware queries over indoor moving
objects.
TUE/9
Seminar 4: Knowledge Harvesting from Text and Web Sources
4 - 5:30PM
Odeon
Fabian Suchanek, Gerhard Weikum (Max Planck Institute for Informatics)
The proliferation of knowledge-sharing communities such as Wikipedia and the
progress in scalable information extraction from Web and text sources has enabled
the automatic construction of very large knowledge bases. Recent endeavors
of this kind include academic research projects such as DBpedia, KnowItAll,
Probase, ReadTheWeb, and YAGO, as well as industrial ones such as Freebase and
Trueknowledge. These projects provide automatically constructed knowledge bases
of facts about named entities, their semantic classes, and their mutual relationships.
Such world knowledge in turn enables cognitive applications and knowledge-centric
services like disambiguating natural-language text, deep question answering, and
semantic search for entities and relations in Web and enterprise data. Prominent
examples of how knowledge bases can be harnessed include the Google Knowledge
Graph and the IBM Watson question answering system. This tutorial presents stateof-the-art methods, recent advances, research opportunities, and open challenges
along this avenue of knowledge harvesting and its applications.
Industry 3
4 - 5:30PM
Chair: Vibhor Rastogi (Google Inc.) Concorde
Pipe Failure Prediction: A Data Mining Method
62
Rui Wang (University of Science & Technology of China) Weishan Dong, Yu Wang (IBM
Research – China) Ke Tang (University of Science & Technology of China) Xin Yao
(University of Science & Technology of China / The University of Birmingham)
Pipe breaks in urban water distribution network lead to significant economical and
social costs, putting the service quality as well as the profit of water utilities at risk.
To cope with such a situation, scheduled preventive maintenance is desired, which
aims to predict and fix potential break pipes proactively. Physical models developed
for understanding and predicting the failure of pipes are usually expensive, thus can
ICDE 2013 Conference
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
only be used on a limited number of trunk pipes. As an alternative, statistical models
that try to predict pipe breaks based on historical data are far less expensive,
and therefore have attracted a lot of interests from water utilities recently. In this
paper, we report a novel data mining prediction system that has been built for a
water utility in a big Chinese city. Various aspects of how to build such a system are
described, including problem formulation, data cleaning, model construction, as well
as evaluating the importance of attributes according to the requirements of end
users in water utilities. Satisfactory results have been achieved by our prediction
system. For example, with the system trained on the available dataset at the end of
2010, the water utility would avoid 50% of pipe breaks in 2011 by examining only
6.98% of its pipes in advance. During the construction of the system, we find that
the extremely skew distribution of break and non-break pipes, interestingly, is not an
obstacle. This lesson could serve as a practical reference for both academic studies
on imbalanced learning as well as future explorations on pipe failure prediction
problems.
SASH: Enabling Continuous Incremental Analytic Workflows on Hadoop
Manish Sethi, Narendran Sachindran, Sriram Raghavan (IBM India Research Lab)
There is an emerging class of enterprise applications in areas such as log data
analysis, information discovery, and social media marketing that involve analytics
over large volumes of unstructured and semi-structured data. These applications
are leveraging new analytics platforms based on the MapReduce framework and its
open source Hadoop implementation. While this trend has engendered work on
high-level data analysis languages, NoSQL data stores, workflow engines etc., there
has been very little attention to the challenges of deploying analytic workflows
into production for continuous operation. In this paper, we argue that an essential
platform component for enabling continuous production analytic workflows is an
analytics store. We highlight five key requirements that impact the design of such a
store: (i) efficient incremental operations, (ii) flexible storage model for hierarchical
data, (iii) snapshot isolation (iv) object-level incremental updates, and (v) support
for handling change sets. We describe the design of SASH, a scalable analytics store
that we have developed on top of HBase to address these requirements. Using the
workload from a production workflow that powers search within IBM’s intranet and
extranet, we demonstrate orders of magnitude improvement in IO performance
using SASH.
ICDE 2013 Conference
63
DETAILED PROGRAM FOR TUESDAY 9 APRIL
TUE/9
Automating Pattern Discovery for Rule Based Data Standarization Systems
Snigdha Chaturvedi, Hima Prasad K, Tanveer A. Faruquie, Bhupesh S. Chawda, L Venkata
Subramaniam, Raghuram Krishnapuram (IBM Research-India)
Data quality is a perennial problem for many enterprise data assets. To improve data
quality, businesses often employ rule based data standardization systems in which
domain experts code rules for handling important and prevalent patterns. Finding
these patterns is laborious and time consuming, particularly for noisy or highly
specialized data sets. It is also subjective to the persons determining these patterns.
In this paper we present a tool to automatically mine patterns that can help in
improving the efficiency and effectiveness of these data standardization systems. The
automatically extracted patterns are used by the domain and knowledge experts
for rule writing. We use a greedy algorithm to extract patterns that result in a
maximal coverage of data. We further group the extracted pat- terns such that each
group represents patterns that capture similar domain knowledge. We propose a
similarity measure that uses in- put pattern semantics to group these patterns. We
demonstrate the effectiveness of our method for standardization tasks on three real
world datasets.
Demo Groups 1 & 2
4 - 5:30PM
See Demo Groups 1 & 2 on (p. 49) for demonstration details.
64
Ballroom 2
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Wednesday 10 April
Keynote 2
9 - 10AM
Chair: Chris Jermaine (Rice University)
Ballroom Le Grand
Recent Advances on Structured Data and the Web
Alon Halevy (Google Inc.)
Abstract: The World-Wide Web contains vast quantities of structured data on a
variety of domains, such as hobbies, products and reference data. Moreover, the Web
provides a platform that can encourage publishing more data sets from governments
and other public organizations and support new data management opportunities,
such as effective crisis response, data journalism and crowd-sourcing data sets. For
the first time since the emergence of the Web, structured data is being used widely
by search engines and is being collected via a concerted effort. I will describe some
of the efforts we are conducting at Google to collect structured data, filter the highquality content, and serve it to our users. These efforts include providing Google
Fusion Tables, a service for easily ingesting, visualizing and integrating data, mining the
Web for high-quality HTML tables, and contributing these data assets to Google’s
other services.
ICDE 2013 Conference
WED/10
Bio: Alon Halevy heads the Structured Data Management Research group at
Google. Prior to that, he was a professor of Computer Science at the University of
Washington in Seattle, where he founded the database group. In 1999, Dr. Halevy
co-founded Nimble Technology, one of the first companies in the Enterprise
Information Integration space, and in 2004, Dr. Halevy founded Transformic, a
company that created search engines for the deep web, and was acquired by
Google. Dr. Halevy is a Fellow of the Association for Computing Machinery, received
the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000,
and was a Sloan Fellow (1999-2000). He received his Ph.D in Computer Science
from Stanford University in 1993 and his Bachelors from the Hebrew University
in Jerusalem. Halevy is also a coffee culturalist and published the book “The Infinite
Emotions of Coffee”, published in 2011 and a co-author of the book “Principles of
Data Integration”, published in 2012.
65
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Research 13: Data Cleaning
10:30 - 12PM
Chair: Raghav Kaushik (Microsoft) St Germaine
HANDS: A Heuristically Arranged Non-Backup In-line Deduplication System
Avani Wildani, Ethan L. Miller (University of California Santa Cruz) Ohad Rodeh (IBM
Almaden Research Center)
Deduplicating in-line data on primary storage is hampered by the disk bottleneck
problem, an issue which results from the need to keep an index mapping portions
of data to hash values in memory in order to detect duplicate data without paying
the performance penalty of disk paging. The index size is proportional to the volume
of unique data, so placing the entire index into RAM is not cost effective with a
deduplication ratio below 45%. HANDS reduces the amount of in-memory index
storage required by up to 99% while still achieving between 30% and 90% of the
deduplication a full memory-resident index provides, making primary deduplication
cost effective in workloads with deduplication rates as low as 8%. HANDS is a
framework that dynamically pre-fetches fingerprints from disk into memory cache
according to working sets statistically derived from access patterns. We use a simple
neighborhood grouping as our statistical technique to demonstrate the effectiveness
of our approach. HANDS is modular and requires only spatio-temporal data, making
it suitable for a wide range of storage systems without the need to modify host file
systems.
WED/10
Holistic Data Cleaning: Putting Violations Into Context
66
Xu Chu (University of Waterloo) Ihab Ilyas, Paolo Papotti (Qatar Computing Research
Institute)
Data cleaning is an important problem and data quality rules are the most
promising way to face it with a declarative approach. Previous work has focused on
specific formalisms, such as functional dependencies (FDs), conditional functional
dependencies (CFDs), and matching dependencies (MDs), and those have always
been studied in isolation. Moreover, such techniques are usually applied in a
pipeline or interleaved. In this work we tackle the problem in a novel, unified
framework. First, we let users specify quality rules using denial constraints with adhoc predicates. This language subsumes existing formalisms and can express rules
involving numerical values, with predicates such as ``greater than’’ and ``less than’’.
More importantly, we exploit the interaction of the heterogeneous constraints by
encoding them in a conflict hypergraph. Such holistic view of the conflicts is the
starting point for a novel definition of "repair context" which allows us to compute
automatically repairs of better quality w.r.t. previous approaches in the literature.
Experimental results on real datasets show that the holistic approach outperforms
previous algorithms in terms of quality and efficiency of the repair.
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Inferring Data Currency and Consistency for Conflict Resolution
Wenfei Fan (The University of Edinburgh / Beihang University) Floris Geerts (University of
Antwerp) Nan Tang (Qatar Computing Research Institute) Wenyuan Yu (The University of
Edinburgh)
This paper introduces a new approach for conflict resolution: given a set of tuples
pertaining to the same entity, it is to identify a single tuple in which each attribute
has the latest and consistent value in the set. This problem is important in data
integration, data cleaning and query answering. It is, however, challenging since in
practice, reliable timestamps are often absent, among other things. We propose
a model for conflict resolution, by specifying data currency in terms of partial
currency orders and currency constraints, and by enforcing data consistency
with constant conditional functional dependencies. We show that identifying data
currency orders helps us repair inconsistent data, and vice versa. We investigate a
number of fundamental problems associated with conflict resolution, and establish
their complexity. In addition, we introduce a framework and develop algorithms for
conflict resolution, by integrating data currency and consistency inferences into a
single process, and by interacting with users. We experimentally verify the accuracy
and efficiency of our methods using real-life and synthetic data.
Research 14: Social media I
10:30 - 12PMChair: Kevin Chang (University of Illinois at Urbana-Champaign)Bastille 1
LSII: An Indexing Structure for Exact Real-Time Search on Microblogs
ICDE 2013 Conference
WED/10
Lingkun Wu (Nanyang Technological University / A*STAR Singapore) Wenqing Lin, Xiaokui
Xiao (Nanyang Technological University) Yabo Xu (Sun Yat-Sen University)
Indexing microblogs for real-time search is challenging given the efficiency issue
caused by the tremendous speed at which new microblogs are created by users.
Existing approaches address this efficiency issue at the cost of query accuracy, as
they either (i) exclude a significant portion of microblogs from the index to reduce
update cost or (ii) rank microblogs mostly by their timestamps (without sufficient
consideration of their relevance to the queries) to enable append-only index
insertion. As a consequence, the search results returned by the existing approaches
do not satisfy the users who demand timely and high-quality search results. To
remedy this deficiency, we propose the Log-Structured Inverted Indices (LSII), a
structure for exact real-time search on microblogs. The core of LSII is sequence of
inverted indices with exponentially increasing sizes, such that new microblogs are
(i) first inserted into the smallest index and (ii) later merged with the larger indices
in a batch manner. The batch insertion mechanism ensures a small amortize update
cost for each new microblog, without significantly degrading query performance. We
present a comprehensive study on LSII, exploring various design options to strike
a good balance between query and update performance. In addition, we propose
67
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
extensions of LSII to support personalized search and to exploit multi-threading for
performance improvement. Extensive experiments demonstrate the efficiency of
LSII with experiments on real data.
Utilizing Users’ Tipping Points in E-Commerce Recommender Systems
Kailun Hu, Wynne Hsu, Mong Li Lee (National University of Singapore)
Existing recommendation algorithms assume that users make their purchase
decisions solely based on individual preferences, without regard to the purchase
behavior of other users. Yet, extensive studies have shown that there are two types
of users: innovators and imitators. Innovators tend to make purchase decisions
based solely on their own preferences; whereas imitators’ purchase decisions are
often influenced by social pressure from other users. In this paper, we propose a
framework that seamlessly incorporate the influence of social pressure into existing
recommendation algorithms. We utilize the Bass model to classify each user as either
an innovator or imitator according to his/her previous purchase behavior. In addition,
we introduce the concept of pressure point of a user to capture the user’s reaction
to varying degree of social pressure when making a purchase decision. We then
refine two widely-adopted recommendation algorithms to incorporate the effect
of social pressure in relation to the user’s pressure point. Experiment results on a
real-world dataset obtained from an E-commerce website show that the proposed
approach outperforms existing algorithms.
WED/10
Presenting Diverse Location Views with Real-time Near-duplicate Photo
Elimination
68
Jiajun Liu, Zi Huang (The University of Queensland) Hong Cheng (The Chinese Univerity
of Hong Kong) Yueguo Chen (Renmin University of China) Heng Tao Shen (The University
of Queensland) Yanchun Zhang (Victoria University)
Supported by the technical advances and the commercial success of GPS-enabled
mobile devices, geo-tagged photos have drawn plenteous attention in research
community. The explosive growth of geo-tagged photos enables many large-scale
applications, such as location-based photo browsing, landmark recognition, etc.
Meanwhile, as the number of geo-tagged photos continues to climb, new challenges
are brought to various applications. The existence of massive near-duplicate geotagged photos jeopardizes the effective presentation for the above applications. A
new dimension in the search and presentation of geo-tagged photos is urgently
demanded. In this paper, we devise a location visualization framework to efficiently
retrieve and present diverse views captured within a local proximity. Novel photos,
in terms of capture locations and visual content, are identified and returned in
response to a query location for diverse visualization. For real-time response
and good scalability, a new Hybrid Index structure which integrates R-tree and
Geographic Grid is proposed to quickly identify the Maximal Near-duplicate Photo
Groups (MNPG) in the query proximity. The most novel photos from different
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
groups are then returned to generate diverse views on the location. Extensive
experiments on synthetic and real-life photo datasets prove the novelty and
efficiency of our methods.
Research 15: Data Trust
10:30 - 12PM
Chair: Stefano Paraboschi (University of Bergamo) Bastille 2
Publicly Verifiable Grouped Aggregation Queries on Outsourced Data Streams
WED/10
Suman Nath, Ramarathnam Venkatesan (Microsoft Research)
Outsourcing data streams and desired computations to a third party such as the
cloud is a desirable option to many companies. However, data outsourcing and
remote computations intrinsically raise issues of trust, making it crucial to verify
results returned by third parties. In this context, we propose a novel solution to
verify outsourced grouped aggregation queries (e.g., histogram or SQL Group-by
queries) that are common in many business applications. We consider a setting
where a data owner employs an untrusted remote server to run continuous
grouped aggregation queries on a data stream it forwards to the server. Untrusted
clients then query the server for results and efficiently verify correctness of the
results by using a small and easy-to-compute signature provided by the data owner.
Our work complements previous works on authenticating remote computation of
selection and aggregation queries. The most important aspect of our solution is that
it is publicly verifiable---unlike most prior works, we support untrusted clients (who
can collude with other clients or with the server). Experimental results on real and
synthetic data show that our solution is practical and efficient.
Trustworthy Data from Untrusted Databases
Rohit Jain, Sunil Prabhakar (Purdue University)
Ensuring the trustworthiness of data retrieved from a database is of utmost
importance to users. The correctness of data stored in a database is defined by the
faithful execution of only valid (authorized) transactions. In this paper we address
the question of whether it is necessary to trust a database server in order to trust
the data retrieved from it. The lack of trust arises naturally if the database server is
owned by a third party, as in the case of cloud computing. It also arises if the server
may have been compromised, or there is a malicious insider. In particular, we reduce
the level of trust necessary in order to establish the authenticity and integrity of data
at an untrusted server. Earlier work on this problem is limited to situations where
there are no updates to the database, or all updates are authorized and vetted
by a central trusted entity. This is an unreasonable assumption for a truly dynamic
database, as would be expected in many business applications, where multiple clients
can update data without having to check with a central server that approves of
their changes. We identify the problem of ensuring trustworthiness of data at an
ICDE 2013 Conference
69
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
untrusted server in the presence of transactional updates that run directly on the
database, and develop the first solutions to this problem. Our solutions also provide
indemnity for an honest server and assured provenance for all updates to the data.
We implement our solution in a prototype system built on top of Oracle with no
modifications to the database internals. We also provide an empirical evaluation of
the proposed solutions and establish their feasibility.
On the Relative Trust between Inconsistent Data and Inaccurate Constraints
George Beskales, Ihab Ilyas (Qatar Computing Research Institute) Lukasz Golab, Artur
Galiullin (University of Waterloo)
Functional dependencies (FDs) specify the intended data semantics while violations
of FDs indicate deviation from these semantics. In this paper, we study a data
cleaning problem in which the FDs may not be completely correct, e.g., due to data
evolution or incomplete knowledge of the data semantics. We argue that the notion
of relative trust is a crucial aspect of this problem: if the FDs are outdated, we
should modify them to fit the data, but if we suspect that there are problems with
the data, we should modify the data to fit the FDs. In practice, it is usually unclear
how much to trust the data versus the FDs. To address this problem, we propose an
algorithm for generating non-redundant solutions (i.e., simultaneous modifications
of the data and the FDs) corresponding to various levels of relative trust. This can
help users determine the best way to modify their data and/or FDs to achieve
consistency.
WED/10
Research 16: Data on the Cloud
70
10:30 - 12PM
Chair: Karl Aberer (EPFL) Concorde
Catch the Wind: Graph Workload Balancing on Cloud
Zechao Shang, Jeffrey Xu Yu (The Chinese University of Hong Kong)
Graph partitioning is a key issue in graph database processing systems for achieving
high efficiency on Cloud. How ever, the balanced graph partitioning itself is difficult
because it is known to be NP-complete. In addition a static graph partitioning
cannot keep all graph algorithms efficient for a long time in parallel on Cloud
because the workload balancing in different iterations for different graph algorithms
are all possible different. In this paper, we investigate graph behaviors by exploring
the working window (we call it wind) changes, where a working window is a set of
active vertices that a graph algorithm really needs to access in parallel computing.
We investigated nine classic graph algorithms using real datasets, and propose simple
yet effective policies that can achieve both high graph workload balancing and
efficient partition on Cloud.
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
EAGRE: Towards Scalable I/O Efficient SPARQL Query Evaluation on the Cloud
Xiaofei Zhang, Lei Chen, Yongxin Tong (Hong Kong University of Science and Technology)
Min Wang (HP Labs China)
To benefit from the Cloud platform's unlimited resources, effective management
and query evaluation over huge volumes of RDF data in a scalable manner attracts
intensive research efforts. Progresses have been made on evaluating SPARQL
queries with either high-level declarative programming languages, like Pig and
Sward, or simple MapReduce jobs, both of which tend to answer the query with
multiple joins. However, due to the simplicity of Cloud storage and the coarse
organization of RDF data in existing solutions, multiple join operations bring
significant I/O traffic that severely degrades the system performance. In this work,
we first propose EAGRE, an Entity Aware Graph compREssion technique to form a
new representation of RDF data on Cloud platforms. Then based on an novel cost
model, we propose an I/O efficient strategy to evaluate SPARQL queries as quickly
as possible, especially queries with solution modifiers specified, e.g., PROJECTION,
ORDER BY, etc. We implement a prototype system and conduct extensive
experiments over both real and synthetic data sets on an in-house cluster. The
experimental results show that our solution can achieve over an order of magnitude
of time saving for the SPARQL query evaluation comparing to the state-of-art
MapReduce-based solutions.
ICDE 2013 Conference
WED/10
C-Cube: Elastic Continuous Clustering in the Cloud
Zhenjie Zhang (Advanced Digital Sciences Center) Hu Shu, Zhihong Chong (Southeast
University, China) Hua Lu (Aalborg University) Yin Yang (Advanced Digital Sciences
Center)
Continuous clustering analysis over a data stream reports clustering results
incrementally as updates arrive. Such analysis has a wide spectrum of applications,
including traffic monitoring and topic discovery on microblogs. A common
characteristic of streaming applications is that the amount of workload fluctuates,
often in an unpredictable manner. On the other hand, most existing solutions for
continuous clustering assume either a central server, or a distributed setting with
a fixed number of dedicated servers. In other words, they are not elastic, meaning
that they cannot dynamically adapt to the amount of computational resources to
the fluctuating workload. Consequently, they incur considerable waste of resources,
as the servers are under-utilized when the amount of workload is low. This paper
proposes C-Cube, the first elastic approach to continuous streaming clustering.
Similar to popular cloud-based paradigms such as MapReduce, C-Cube routes each
new record to a processing unit, e.g., a virtual machine, based on its hash value.
Each processing unit performs the required computations, and sends its results to a
lightweight aggregator. This design enables dynamic adding/removing processing units,
as well as replacing faulty ones and re-running their tasks. In addition to elasticity,
C-Cube is also effective (in that it provides quality guarantees on the clustering
71
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
results), efficient (it minimizes the computational workload at all times), and generally
applicable to a large class of clustering criteria. We implemented C-Cube in a real
system based on Twitter Storm, and evaluated it using real and synthetic datasets.
Extensive experimental results confirm our performance claims.
Seminar 5: Sorting in Space: Multidimensional, Spatial, and Metric
Data Structures for Applications in Spatial Databases, Geographic
Information Systems (GIS), and Location-based Services
10:30 - 12PM
Odeon
Hanan Samet (University of Maryland)
Techniques for representing multidimensional, spatial, and metric data for applications
in spatial databases, geographic information systems (GIS), and location-based
services are reviewed.
This includes both geometric and textual representations of spatial data.
SAP Business Lunch & ICDE Award Presentations
12 - 2 PM
Chair: Rao Kotagiri (University of Melbourne) Ballroom Le Grand
WED/10
Keynote 4: 10 Year Most Influential Papers
72
2 - 3PM
Chair: Rao Kotagiri (University of Melbourne) Ballroom Le Grand
Schema Mediation in Peer Data Management Systems [ICDE 2003]
Alon Y. Halevy, Zachary G. Ives, Dan Suciu, Igor Tatarinov (University of Washington)
Intuitively, data management and data integration tools should be well-suited for
exchanging information in a semantically meaningful way. Unfortunately, they suffer
from two significant problems: they typically require a comprehensive schema
design before they can be used to store or share information, and they are difficult
to extend because schema evolution is heavyweight and may break backwards
compatibility. As a result, many small-scale data sharing tasks are more easily
facilitated by non-database-oriented tools that have little support for semantics.
The goal of the peer data management system (PDMS) is to address this need:
we propose the use of a decentralized, easily extensible data management
architecture in which any user can contribute new data, schema information, or even
mappings between other peers’ schemas. PDMSs represent a natural step beyond
data integration systems, replacing their single logical schema with an interlinked
collection of semantic mappings between peers’ individual schemas. This paper
considers the problem of schema mediation in a PDMS. Our first contribution is
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
a flexible language for mediating between peer schemas, which extends known
data integration formalisms to our more complex architecture. We precisely
characterize the complexity of query answering for our language. Next, we describe
a reformulation algorithm for our language that generalizes both global-as-view and
local-as-view query answering algorithms. Finally, we describe several methods for
optimizing the reformulation algorithm, and an initial set of experiments studying its
performance.
Similarity Flooding: A Versatile Graph Matching Algorithm and its Application to
Schema Matching [ICDE 2002]
ICDE 2013 Conference
WED/10
Sergey Melnik, Hector Garcia-Molina (Stanford University) Erhard Rahm (University of
Leipzig)
Matching elements of two data schemas or two data instances plays a key role in
data warehousing, e-business, or even biochemical applications. In this paper we
present a matching algorithm based on a fixpoint computation that is usable across
different scenarios. The algorithm takes two graphs (schemas, catalogs, or other data
structures) as input, and produces as output a mapping between corresponding
nodes of the graphs. Depending on the matching goal, a subset of the mapping is
chosen using filters. After our algorithm runs, we expect a human to check and if
necessary adjust the results. As a matter of fact, we evaluate the ‘accuracy’ of the
algorithm by counting the number of needed adjustments. We conducted a user
study, in which our accuracy metric was used to estimate the labor savings that the
users could obtain by utilizing our algorithm to obtain an initial matching. Finally,
we illustrate how our matching algorithm is deployed as one of several highlevel operators in an implemented testbed for managing information models and
mappings.
73
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Research 17: Similarity Ranking
3:30 - 5PM
Chair: Ihab Ilyas (Qatar Computing Research Institute)
St Germaine
WED/10
Efficient Search Algorithm for SimRank
74
Yasuhiro Fujiwara (NTT Software Innovation Center) Makoto Nakatsuji (NTT Service
Evolution Laboratories) Hiroaki Shiokawa, Makoto Onizuka (NTT Software Innovation
Center)
Graphs are a fundamental data structure and have been employed to model
objects as well as their relationships. The similarity of objects on the web (e.g.,
webpages, photos, music, micro-blogs, and social networking service users) is the
key to identifying relevant objects in many recent applications. SimRank, proposed
by Jeh and Widom, provides a good similarity score and has been successfully used
in many applications such as web spam detection, collaborative tagging analysis,
link prediction, and so on. SimRank computes similarities iteratively, and it needs
O(N4T) time and O(N2) space for similarity computation where N and T are the
number of nodes and iterations, respectively. Unfortunately, this iterative approach
is computationally expensive. The goal of this work is to process top-k and range
search efficiently for a given node. Our solution, SimMat, is based on two ideas:
(1) It computes the approximate similarity of a selected node pair efficiently in
non-iterative style based on the Sylvester equation, and (2) It prunes unnecessary
approximate similarity computations when searching for the high similarity nodes
by exploiting estimations based on the Cauchy-Schwarz inequality. These two ideas
reduce the time and space complexities of the proposed approach to O(Nn)
where n is the target rank of the low-rank approximation (n << N in practice). Our
experiments show that our approach is much faster, by several orders of magnitude,
than previous approaches in finding the high similarity nodes.
Towards Efficient SimRank Computation on Large Networks
Weiren Yu (University of New South Wales/NICTA) Xuemin Lin (East China Normal
Universit /University of New South Wales) Wenjie Zhang (University of New South Wales)
SimRank has been a powerful model for assessing the similarity of pairs of vertices
in a graph. It is based on the concept that two vertices are similar if they are
referenced by similar vertices. Due to its self-referentiality, fast SimRank computation
on large graphs poses significant challenges. The state-of-the-art work exploits partial
sums memorization for computing SimRank in O(Kmn) time on a graph with n
vertices and m edges, where K is the number of iterations. Partial sums memorizing
can reduce repeated calculations by caching part of similarity summations for later
reuse. However, we observe that computations among different partial sums may
have redundancy. Besides, for a desired accuracy ε, the existing SimRank model
requires K = ⌈logCε⌉ iterations, where C is a damping factor. Nevertheless, such a
geometric rate of convergence is slow in practice if a high accuracy is desirable. In
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
this paper, we address these gaps. (1) We propose an adaptive clustering strategy
to eliminate partial sums redundancy (i.e., duplicated computations occurring in
partial sums), and devise an efficient algorithm for speeding up the computation of
SimRank to O(Kd′n2) time, where d′ is typically much smaller than the average
in-degree of a graph. (2) We also present a new notion of SimRank that is based on
a differential equation and can be represented as an exponential sum of transition
matrices, as opposed to the geometric sum of the conventional counterpart. This
leads to a further speedup in the convergence rate of SimRank iterations. (3) Using
real and synthetic data, we empirically verify that our approach of partial sums
sharing outperforms the best known algorithm by up to one order of magnitude,
and that our revised notion of SimRank further achieves a 5X speedup on large
graphs while fairly preserving the relative order of original SimRank scores.
RoundTripRank: Graph-based Proximity with Importance and Specificity
ICDE 2013 Conference
WED/10
Yuan Fang, Kevin Chen-Chuan Chang (University of Illinois at Urbana-Champaign /
Advanced Digital Sciences Center) Hady W. Lauw (Singapore Management University)
Graph-based proximity has many applications with different ranking needs. However,
most previous works only stress the sense of importance by finding “popular” results
for a query. Often times important results are overly general without being welltailored to the query, lacking a sense of specificity---which only emerges recently.
Even then, the two senses are treated independently, and only combined empirically.
In this paper, we generalize the well-studied importance-based random walk into
a round trip and develop RoundTripRank, seamlessly integrating specificity and
importance in one coherent process. We also recognize the need for a flexible
trade-off between the two senses, and further develop RoundTripRank+ based on
a scheme of hybrid random surfers. For efficient computation, we start with a basic
model that decomposes RoundTripRank into smaller units. For each unit, we apply
a novel two-stage bounds updating framework, enabling an online top-K algorithm
2SBound. Finally, our experiments show that RoundTripRank and RoundTripRank+
are robust over various ranking tasks, and 2SBound enables scalable online
processing.
75
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Research 18: Spatial Databases
3:30 - 5PM Chair: Mourad Ouzzani (Qatar Computing Research Institute)
Bastille 1
Finding Distance-Preserving Subgraphs in Large Road Networks
Da Yan (Hong Kong University of Science and Technology) James Cheng (The Chinese
University of Hong Kong) Wilfred Ng, Steven Liu (Hong Kong University of Science and
Technology)
Given two sets of points, S and T, in a road network, G, a distance-preserving
subgraph (DPS) query returns a subgraph of G that preserves the shortest path
from any point in S to any point in T. DPS queries are important in many real world
applications, such as route recommendation systems, logistics planning, and all kinds
of shortest-path-related applications that run on resource-limited mobile devices.
In this paper, we study efficient algorithms for processing DPS queries in large road
networks. Four algorithms are proposed with different tradeoffs in terms of DPS
quality and query processing time, and the best one is a graph-partitioning based
index, called RoadPart, that finds a high quality DPS with short response time.
Extensive experiments on large road networks demonstrate the merits of our
algorithms, and verify the efficiency of RoadPart for finding a high-quality DPS.
WED/10
Maximum Visibility Queries in Spatial Databases
76
Sarah Masud, Farhana Murtaza Choudhury, Mohammed Eunus Ali (Bangladesh
University of Engineering and Technology) Sarana Nutanong (University of Maryland)
Many real-world problems, such as placement of surveillance cameras, pricing of
hotel rooms with a view, require the ability to determine the visibility of a given
target object from different locations. Advances in large-scale 3D modeling (e.g., 3D
virtual cities) provide us with data that can be used to solve these problems with
high accuracy. In this paper, we investigate the problem of finding the location which
provides the best view of a target object with visual obstacles in 2D or 3D space,
for example, finding the location that provides the best view of fireworks in a city
with tall buildings. To solve this problem, we first define the quality measure of a
view (i.e., visibility measure) as the visible angular size of the target object. Then, we
propose a new query type called the k-Maximum Visibility (kMV) query, which finds
k locations from a set of locations that maximize the visibility of the target object.
Our objective in this paper is to design a query solution which is capable of handling
large-scale city models. This objective precludes the use of approaches that rely on
constructing a visibility graph of the entire data space. As a result, we propose three
approaches that incrementally consider relevant obstacles in order to determine
the visibility of a target object from a given set of locations. These approaches
differ in the order of obstacle retrieval, namely: query centric distance based, query
centric visible region based, and target centric distance based approaches. We
have conducted an extensive experimental study on real 2D and 3D datasets to
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
demonstrate the efficiency and effectiveness of our solutions.
Memory-Efficient Algorithms for Spatial Network Queries
ICDE 2013 Conference
WED/10
Sarana Nutanong, Hanan Samet (University of Maryland)
Incrementally finding the k nearest neighbors (kNN) in a spatial network is an
important problem in location-based services. One method (INE) simply applies
Dijkstra’s algorithm. Another method (IER) computes the k nearest neighbors using
Euclidean distance followed by computing their corresponding network distances,
and then incrementally finds the next nearest neighbors in order of increasing
Euclidean distance until finding one whose Euclidean distance is greater than
the current k nearest neighbor in terms of network distance. The LBC method
improves on INE by avoiding the visit of nodes that cannot possibly lead to the k
nearest neighbors by using a Euclidean heuristic estimator, and on IER by avoiding
the repeated visits to nodes in the spatial network that appear on the shortest
paths to different members of the k nearest neighbors by performing multiple
instances of heuristic search using a Euclidean heuristic estimator on candidate
objects around the query point. LBC’s drawback is that the maintenance of multiple
instances of heuristic search (called wavefronts) requires k priority queues and the
queue operations required to maintain them incur a high in-memory processing
cost. A method (SWH) is proposed that utilizes a novel heuristic function which
considers objects surrounding the query point together as a single unit, instead of
as one destination at a time as in LBC, thereby eliminating the need for multiple
wavefronts and needs just one priority queue. These results in a significant reduction
in the in-memory processing cost components while having the same reduced cost
of the access to the spatial network as LBC. SWH is also extended to support
the incremental distance semi-join (IDSJ) query, which is a multiple query point
generalization of the kNN query. In addition, SWH is shown to support landmarkbased heuristic functions, thereby enabling it to be applied to nonspatial networks/
graphs such as social networks. Comparisons of experiments on SWH for kNN
queries with INE, the best single-wavefront method, show that SWH is 2.5 times
faster, and with LBC, the best existing heuristic search method, show that SWH is
3.5 times faster. For IDSJ queries, SWH-IDSJ is 5 times faster than INE-IDSJ, and 4
times faster than LBC-IDSJ.
77
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Research 19: Social Media II
3:30 - 5PM
Chair: Tao Cheng (Microsoft)
Location: Bastille 2
WED/10
A Unified Model for Stable and Temporal Topic Detection from Social Media Data
78
Hongzhi Yin, Bin Cui (Peking University) Hua Lu (Aalborg University) Yuxin Huang, Junjie
Yao (Peking University)
Web 2.0 users generate and spread huge amounts of messages in online social
media. Such user-generated contents are mixture of temporal topics (e.g., breaking
events) and stable topics (e.g., user interests). Due to their different natures, it is
important and useful to distinguish temporal topics from stable topics in social
media. However, such a discrimination is very challenging because the user-generated
texts in social media are very short in length and thus lack useful linguistic features
for precise analysis using traditional approaches. In this paper, we propose a novel
solution to detect both stable and temporal topics simultaneously from social media
data. Specifically, a unified user-temporal mixture model is proposed to distinguish
temporal topics from stable topics. To improve this model’s performance, we
design a regularization framework that exploits prior spatial information in a social
network, as well as a burst-weighted smoothing scheme that exploits temporal prior
information in the time dimension. We conduct extensive experiments to evaluate
our proposals on two real data sets obtained from Del.icio.us and Twitter. The
experimental results verify that our mixture model is able to distinguish temporal
topics from stable topics in a single detection process. Our mixture model enhanced
with the spatial regularization and the burst-weighted smoothing scheme significantly
outperforms competitor approaches, in terms of topic detection accuracy and
discrimination in stable and temporal topics.
Crowdsourced Enumeration Queries
Beth Trushkowsky, Tim Kraska, Michael Franklin, Purnamrita Sarkar (University of
California Berkeley)
Hybrid human/computer database systems promise to greatly expand the usefulness
of query processing by incorporating the crowd for data gathering and other tasks.
Such systems raise many implementation questions. Perhaps the most fundamental
question is that the closed world assumption underlying relational query semantics
does not hold in such systems. As a consequence the meaning of even simple
queries can be called into question. Furthermore query progress monitoring
becomes difficult due to non-uniformities in the arrival of crowdsourced data and
peculiarities of how people work in crowdsourcing systems. To address these issues,
we develop statistical tools that enable users and systems developers to reason
about query completeness. These tools can also help drive query execution and
crowdsourcing strategies. We evaluate our techniques using experiments on a
popular crowdsourcing platform.
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
On Incentive-based Tagging
Xuan S. Yang, Reynold Cheng, Luyi Mo, Ben Kao, David W. Cheung (The University of Hong
Kong)
A social tagging system, such as del.icio.us and Flickr, allows users to annotate
resources (e.g., web pages and photos) with text descriptions called tags. Tags
have proven to be invaluable information for searching, mining, and recommending
resources. In practice, however, not all resources receive the same attention from
users. As a result, while some highly popular resources are over-tagged, most of the
resources are under-tagged. Incomplete tagging on resources severely affects the
effectiveness of all tag-based techniques and applications. We address an interesting
question: if users are paid to tag specific resources, how can we allocate incentives
to resources in a crowd-sourcing environment so as to maximize the tagging quality
of resources? We address this question by observing that the tagging quality of
a resource becomes stable after it has been tagged a sufficient number of times.
We formalize the concepts of tagging quality (TQ) and tagging stability (TS) in
measuring the quality of a resource’s tag description. We propose a theoretically
optimal algorithm given a fixed “budget” (i.e., the amount of money paid for tagging
resources). This solution decides the amount of rewards that should be invested
on each resource in order to maximize tagging stability. We further propose a few
simple, practical, and efficient incentive allocation strategies. On a dataset from del.
icio.us, our best strategy provides resources with a close-to-optimal gain in tagging
stability.
WED/10
ICDE 2013 Conference
79
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Research 20: Trees and XML
3:30 - 5PM
Chair: Chengfei Liu (Swinburne University of Technology)
Concorde
Ontology-based Subgraph Querying
WED/10
Yinghui Wu, Shengqi Yang, Xifeng Yan (University of California Santa Barbara)
Subgraph querying has been applied in a variety of emerging applications. Traditional
subgraph querying based on subgraph isomorphism requires identical label matching,
which is often too restrictive to capture the matches that are semantically close
to the query graphs. This paper extends subgraph querying to identify semantically
related matches by leveraging ontology information. (1) We introduce the ontologybased subgraph querying, which revises subgraph isomorphism by mapping a query
to semantically related subgraphs in terms of a given ontology graph. We introduce
a metric to measure the closeness of the matches. Based on the metric, we further
introduce an optimization problem to find top K closest matches. (2) We provide a
filtering-and-verification framework to identify (top-K) matches for ontology-based
subgraph queries. The framework efficiently extracts a small subgraph of the data
graph from an ontology index, and further computes the matches by only accessing
the extracted subgraph. (3) In addition, we show that the ontology index can be
efficiently updated upon the changes to the data graphs, enabling the framework to
cope with dynamic data graphs. (4) We experimentally verify the effectiveness and
efficiency of our framework using both synthetic and real life graphs, comparing with
traditional subgraph querying methods.
80
Stratification Driven Placement of Complex Data: A Framework for Distributed
Data Analytics
Ye Wang, Srinivasan Parthasarathy, P Sadayappan (The Ohio State University)
With the increasing popularity of XML data stores, social networks and Web 2.0 and
3.0 applications, complex data formats, such as the trees and graphs are becoming
ubiquitous. Managing and processing such large and complex data stores, on
modern computational eco-systems, to realize actionable information efficiently is an
important challenge. It is our hypothesis that a critical element at the heart of this
challenge relates to the placement, storage and access of such tera- and
peta- scale data. In this work we seek to develop a generic distributed framework to
ease the burden on the programmer and propose an agile and intelligent placement
service layer as a flexible yet unified means to address this challenge. Central to
our framework is the notion of stratification which first attempts to identify groups
of datum that are structurally (or semantically) related. Subsequently strata are
partitioned within this eco-system according to the needs of the application to
maximize locality, balance load, or minimize data skew. Results on several of realworld applications confirm the efficacy and efficiency of our approach.
ICDE 2013 Conference
DETAILED PROGRAM FOR WEDNESDAY 10 APRIL
Optimizing Approximations of DNF Query Lineage in Probabilistic XML
Asma Souihli, Pierre Senellart (Télécom ParisTech, CNRS LTCI)
Probabilistic XML is a probabilistic model for uncertain tree-structured data, with
applications to data integration, information extraction, or uncertain version control.
We explore in this work efficient algorithms for evaluating tree-pattern queries
with joins over probabilistic XML or, more specifically, for listing the answers to a
query along with their computed or approximated probability. The approach relies
on, first, producing the lineage query by evaluating it over the probabilistic XML
document, and, second, looking for an optimal strategy to compute the probability
of the lineage formula. This latter part relies on a query-optimizer–like approach:
exploring different evaluation plans for different parts of the formula and estimating
the cost of each plan, using a cost model for the various evaluation algorithms. We
demonstrate the efficiency of this approach on datasets used in previous research
on probabilistic XML querying, as well as on synthetic data. We also compare the
performance of our query engine with EvalDP, Trio, and MayBMS/SPROUT.
Seminar 6: Triples in the clouds
ICDE 2013 Conference
WED/10
3:30 - 5PM
Odeon
Zoi Kaoudi, Ioana Manolescu (Inria Saclay - Île de France / Université Paris-Sud)
The W3C’s Resource Description Framework (or RDF, in short) is a promising
candidate which may deliver many of the original semi-structured data promises:
flexible structure, optional schema, and rich, flexible URIs as a basis for information
sharing. Moreover, RDF is uniquely positioned to benefit from the efforts of scientific
communities studying databases, knowledge representation, and Web technologies.
Many RDF data collections are being published, going from scientific data to generalpurpose ontologies to open government data, in particular in the Linked Data
movement. Managing such large volumes of RDF data is challenging, due to the
sheer size, the heterogeneity, and the further complexity brought by RDF reasoning.
To tackle the size challenge, distributed storage architectures are required. Cloud
computing is an emerging paradigm massively adopted in many applications for the
scalability, fault-tolerance and elasticity features it provides. This tutorial discusses the
problems involved in efficiently handling massive amounts of RDF data in a cloud
environment. We provide the necessary background, analyze and classify existing
solutions, and discuss open problems and perspectives.
81
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Thursday 11 April
Keynote 3:
9 - 10AM
Chair: Christian Jensen (Aarhus University)
Ballroom 1 & 2
Hardware Killed the Software Star
THU/11
Gustavo Alonso (ETH Zürich)
Abstract: Until relatively recently, the development of data processing applications
took place largely ignoring the underlying hardware. Only in niche applications
(supercomputing, embedded systems) or in special software (operating systems,
database internals, language runtimes) did (some) programmers had to pay
attention to the actual hardware where the software would run. In most cases,
working atop the abstractions provided by either the operating system or by system
libraries was good enough. The constant improvements in processor speed did
the rest. The new millennium has radically changed the picture. Driven by multiple
needs –e.g., scale, physical constraints, energy limitations, virtualization, business
models-- hardware architectures are changing at a speed and in ways that current
development practices for data processing cannot accommodate. From now on,
software will have to be built paying close attention to the underlying hardware
and following strict performance engineering principles. In this talk, several aspects
of the ongoing hardware revolution and its impact on data processing are analyzed,
pointing to the need for new strategies to tackle the challenges ahead.
82
Bio: Gustavo Alonso is a professor at the Department of Computer Science at
ETH Zurich in Switzerland, where he has been since 1995. At ETHZ, he is part of
the Systems Group and the Enterprise Computing Center. Gustavo has a degree
in electrical engineering from the Madrid Technical University in Spain and an
M.S. and Ph.D. in Computer Science from UC Santa Barbara. Before joining ETH,
he worked at the IBM Almaden Research Center. Gustavo’s research interests
encompass almost all aspects of systems, from design to run time. Most of his
research these days is related to multi-core architectures, large clusters, FPGAs, and
cloud computing, with an emphasis on adapting traditional system software (OS,
database, middleware) to these new hardware platforms. Gustavo is a Fellow of the
ACM and Senior Member of the IEEE. He has been awarded the AOSD 2012 Most
Influential Paper Award, the VLDB 2010 Ten Year Best Paper Award, and the ICDCS
2009 Best Paper Award for work on Remote Direct Memory Access. He has served
in the VLDB Endowment, the ACM/IFIP/IEEE Middleware Steering Committe, as
an associate editor of the VLDB Journal, as Chair of EuroSys , and as general chair
or PC-chair/vice-chair in numerous conferences (VLDB, ICDE, Middleware, BPM,
ICDCS, IEEE MDM). ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Research 21: Security and privacy
10:30 - 12PM
Chair: Graham Cormode (AT&T Labs-Research)
St Germaine
Secure Nearest Neighbor Revisited
Bin Yao (Shanghai Jiao Tong University) Feifei Li (University of Utah) Xiaokui Xiao
(Nanyang Technological University)
In this paper, we investigate the secure nearest neighbor (SNN) problem, in
which a client issues an encrypted query point E(q) to a server and asks for an
encrypted data point in E(D) that is closest to the query point, without allowing
the server to learn the plaintexts of the data or the query (and its result). We
show that efficient attacks exist for existing SNN methods, even though they were
claimed to be secure in standard security models (such as indistinguishability under
chosen plaintext or ciphertext attacks). We also establish a relationship between
the SNN problem and the order-preserving encryption (OPE) problem from
the cryptography field, and we show that SNN is at least as hard as OPE. Since
it is impossible to construct secure OPE schemes in standard security models,
our results imply that one cannot expect to find the exact (encrypted) nearest
neighbor based on only E(q) and E(D). Given this hardness result, we design new
SNN methods by asking the server, given only E(q) and E(D), to return a relevant
(encrypted) partition E(G) from E(D) (i.e., G ⊆ D), such that that E(G) is guaranteed
to contain the answer for the SNN query. Our methods provide customizable
tradeoff between efficiency and communication cost, and they are as secure as the
encryption scheme E used to encrypt the query and the database, where E can be
any well-established encryption schemes.
Accurate and Efficient Private Release of Datacubes and Contingency Tables
ICDE 2013 Conference
THU/11
Grigory Yaroslavtsev (Pennsylvania State University) Graham Cormode, Cecilia M.
Procopiuc, Divesh Srivastava (AT&T Labs - Research)
A central problem in releasing aggregate information about sensitive data is to
do so accurately while providing a privacy guarantee on the output. Recent work
focuses on the class of linear queries, which include basic counting queries, data
cubes, and contingency tables. The goal is to maximize the utility of their output,
while giving a rigorous privacy guarantee. Most results follow a common template:
pick a ``strategy’’ set of linear queries to apply to the data, then use the noisy
answers to these queries to reconstruct the queries of interest. This entails either
picking a strategy set that is hoped to be good for the queries, or performing a
costly search over the space of all possible strategies. However, once the strategy is
fixed, its evaluation can be done efficiently, using standard linear algebraic methods.
In this paper, we propose a new approach that balances accuracy and efficiency: we
show how to optimize the accuracy of a given strategy by answering some strategy
queries more accurately than others, based on the target queries. This leads to an
83
DETAILED PROGRAM FOR THURSDAY 11 APRIL
efficient optimal noise allocation for many popular strategies, including wavelets,
hierarchies, Fourier coefficients and more. For the important case of marginal
queries (equivalently, subsets of the data cube), we show that this strictly improves
on previous methods, both analytically and empirically. Our results also extend to
ensuring that the returned query answers are consistent with an (unknown) data set
at minimal extra cost in terms of time and noise.
Differentially Private Grids for Geospatial Data
THU/11
Wahbeh Qardaji, Weining Yang, Ninghui Li (Purdue University)
In this paper, we tackle the problem of constructing a differentially private synopsis
for two-dimensional datasets such as geospatial datasets. The current state-of-the-art
methods work by performing recursive binary partitioning of the data domains, and
constructing a hierarchy of partitions. We show that the key challenge in partitionbased synopsis methods lies in choosing the right partition granularity to balance
the noise error and the non-uniformity error. We study the uniform-grid approach,
which applies an equi-width grid of a certain size over the data domain and then
issues independent count queries on the grid cells. This method has received
no attention in the literature, probably due to the fact that no good method for
choosing a grid size was known. Based on an analysis of the two kinds of errors,
we propose a method for choosing the grid size. Experimental results validate our
method, and show that this approach performs as well as, and often times better
than, the state-of-the-art methods. We further introduce a novel adaptive-grid
method. The adaptive grid method lays a coarse-grained grid over the dataset,
and then further partitions each cell according to its noisy count. Both levels of
partitions are then used in answering queries over the dataset. This method exploits
the need to have finer granularity partitioning over dense regions and, at the same
time, coarse partitioning over sparse regions. Through extensive experiments
on real-world datasets, we show that this approach consistently and significantly
outperforms the uniform-grid method and other state-of-the-art methods.
84
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Research 22: Randomized Algorithms for Graphs
10:30 - 12PM Chair: Yinghui Wu (University of California Santa Barbara)
Bastille 1
Faster Random Walks By Rewiring Online Social Networks On-The-Fly
Zhuojie Zhou, Nan Zhang (George Washington University) Zhiguo Gong (University
of Macau) Gautam Das (University of Texas at Arlington / Qatar Computing Research
Institute)
Many online social networks feature restrictive web interfaces which only allow the
query of a user’s local neighborhood through the interface. To enable analytics over
such an online social network through its restrictive web interface, many recent
efforts reuse the existing Markov Chain Monte Carlo methods such as random
walks to sample the social network and support analytics based on the samples.
The problem with such an approach, however, is the large amount of queries
often required (i.e., a long ``mixing time’’) for a random walk to reach a desired
(stationary) sampling distribution. In this paper, we consider a novel problem of
enabling a faster random walk over online social networks by "rewiring" the social
network on-the-fly. Specifically, we develop Modified TOpology (MTO)-Sampler
which, by using only information exposed by the restrictive web interface, constructs
a ``virtual’’ overlay topology of the social network while performing a random walk,
and ensures that the random walk follows the modified overlay topology rather
than the original one. We show that MTO-Sampler not only provably enhances the
efficiency of sampling, but also achieves significant savings on query cost over realworld online social networks such as Google Plus, Epinion etc.
Sampling Node Pairs Over Large Graphs
ICDE 2013 Conference
THU/11
Pinghui Wang (The Chinese University of Hong Kong) Junzhou Zhao (Xi'an Jiaotong
University) John C.S. Lui (The Chinese University of Hong Kong) Don Towsley (University of
Massachusetts Amherst) Xiaohong Guan (Xi'an Jiaotong University / Tsinghua University)
Characterizing user pair relationships is important for applications such as friend
recommendation and interest targeting in online social networks (OSNs). Due to
the large scale nature of such networks, it is infeasible to enumerate all user pairs
and so sampling is used. In this paper, we show that it is a great challenge even
for OSN service providers to characterize user pair relationships even when they
posses the complete graph topology. The reason is that when sampling techniques
(i.e., uniform vertex sampling (UVS) and random walk (RW)) are naively applied,
they can introduce large biases, in particular, for estimating similarity distribution of
user pairs with constraints such as existence of mutual neighbors, which is important
for applications such as identifying network homophily. Estimating statistics of user
pairs is more challenging in the absence of the complete topology information, since
an unbiased sampling technique such as UVS is usually not allowed, and exploring
the OSN graph topology is expensive. To address these challenges, we present
85
DETAILED PROGRAM FOR THURSDAY 11 APRIL
asymptotically unbiased sampling methods to characterize user pair properties
based on UVS and RW techniques respectively. We carry out an evaluation of our
methods to show their accuracy and efficiency. Finally, we apply our methods to two
Chinese OSNs, Doudan and Xiami, and discover significant homophily is present in
these two networks.
Link Prediction across Networks by Biased Cross-Network Sampling
Guo-Jun Qi (University of Illinois at Urbana-Champaign) Charu C. Aggarwal (IBM T.J.
Watson Research Center) Thomas Huang (University of Illinois at Urbana-Champaign)
The problem of link inference has been widely studied in a variety of social
networking scenarios. In this problem, we wish to predict future links in a growing
network with the use of the existing network structure. However, most of the
existing methods work well only if a significant number of links are already available
in the network for the inference process. In many scenarios, the existing network
may be too sparse, and may have too few links to enable meaningful learning
mechanisms. This paucity of linkage information can be challenging for the link
inference problem. However, in many cases, other (more densely linked) networks
may be available which show similar linkage structure in terms of underlying
attribute information in the nodes. The linkage information in the existing networks
can be used in conjunction with the node attribute information in both networks
in order to make meaningful link recommendations. Thus, this paper introduces the
use of transfer learning methods for performing cross-network link inference. We
present experimental results illustrating the effectiveness of the approach.
Research 23: Distributed Data Processing
10:30 - 12PM
Chair: Tyson Condie (Microsoft)
Bastille 2
THU/11
Interval Indexing and Querying on Key-Value Cloud Stores
86
George Sfakianakis, Ioannis Patlakas, Nikos Ntarmos, Peter Triantafillou (University of
Patras)
Cloud key-value stores are becoming increasingly more important. Challenging
applications, requiring efficient and scalable access to massive data, arise every
day. We focus on supporting interval queries (which are prevalent in several data
intensive applications, such as temporal querying for temporal analytics), an efficient
solution for which is lacking. We contribute a compound interval index structure,
comprised of two tiers: (i) the MRSegmentTree (MRST), a key-value representation
of the Segment Tree, and (ii) the Endpoints Index (EPI), a column family index that
stores information for interval endpoints. In addition to the above, our contributions
include: (i) algorithms for efficiently constructing and populating our indices using
MapReduce jobs, (ii) techniques for efficient and scalable index maintenance, and (iii)
algorithms for processing interval queries. We have implemented all algorithms using
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
HBase and Hadoop, and conducted a detailed performance evaluation. We quantify
the costs associated with the construction of the indices, and evaluate our query
processing algorithms using queries on real data sets. We compare the performance
of our approach to two alternatives: the native support for interval queries provided
in HBase, and the execution of such queries using the Hive query execution tool.
Our results show a significant speedup, far outperforming the state of the art.
Robust Distributed Stream Processing
Chuan Lei, Elke A. Rundensteiner, Joshua D. Guttman (Worcester Polytechnic Institute)
Distributed stream processing systems must function efficiently for data streams that
fluctuate in their arrival rates and data distributions. Yet repeated and prohibitively
expensive load re-allocation across machines may make these systems ineffective,
potentially resulting in data loss or even system failure. To overcome this problem,
we instead propose a load distribution (RLD) strategy that is robust to data
fluctuations. RLD provides e-optimal query performance under load fluctuations
without suffering from the performance penalty caused by load migration. RLD is
based on three key strategies. First, we model robust distributed stream processing
as a parametric query optimization problem. The notions of robust logical and
robust physical plans then are overlays of this parameter space. Second, our Earlyterminated Robust Partitioning (ERP) finds a set of robust logical plans, covering the
parameter space, while minimizing the number of prohibitively expensive optimizer
calls with a probabilistic bound on the space coverage. Third, our OptPrune
algorithm maps the space-covering logical solution to a single robust physical plan
tolerant to deviations in data statistics that maximizes the parameter space coverage
at runtime. Our experimental study using stock market and sensor networks
streams demonstrates that our RLD methodology consistently outperforms stateof-the-art solutions in terms of efficiency and effectiveness in highly fluctuating data
stream environments.
Research 24: Data Mining II
10:30 - 12PM
Chair: Jian Pei (Simon Fraser University)
Concorde
Learning to Rank from Distant Supervision: Exploiting Noisy Redundancy for
Relational Entity Search
ICDE 2013 Conference
THU/11
Mianwei Zhou, Hongning Wang, Kevin Chen-Chuan Chang (University of Illinois at
Urbana-Champaign / Advanced Digital Sciences Center)
In this paper, we propose to study the task of relational entity search which aims
at automatically learning an entity ranking function for a desired relation. To rank
the entities, we exploit the redundancy buried in their snippets; however, such
redundancy is noisy as not all the snippets represent information relevant to the
desired relation. To explore useful information from such noisy redundancy, we
87
DETAILED PROGRAM FOR THURSDAY 11 APRIL
abstract the task as a distantly supervised ranking problem -- based on coarse
entity-level annotations, deriving a relation-specific ranking function for online
searching purpose. As the key challenge, without detailed snippet-level annotations,
we have to filter noisy snippets for estimating an accurate ranking function;
furthermore, the ranking function should also be online executable. We develop
Pattern-based Filter Network (PFNet), a novel probabilistic graphical model, as our
solution. To balance accuracy and efficiency requirement, PFNet selects a limited size
of indicative patterns to filter noisy snippets, and the inverted index is utilized to
retrieve required features. Experiments on the large scale CuleWeb09 data set for
six different relations confirm the effectiveness of the proposed PFNet model, which
outperforms five state-of-the-art relational entity ranking methods.
AFFINITY: Efficiently Querying Statistical Measures on Time-Series Data
Saket Sathe, Karl Aberer (École Polytechnique Fédérale de Lausanne)
Computing statistical measures for large databases of time series is a fundamental
primitive for querying and mining time-series data. This primitive is gaining
importance with the increasing number and rapid growth of time series databases.
In this paper, we introduce a framework for efficient computation of statistical
measures by exploiting the concept of affine relationships. Affine relationships
can be used to infer statistical measures for time series, from other related time
series, instead of computing them directly; thus, reducing the overall computational
cost significantly. The resulting methods exhibit at least one order of magnitude
improvement over the best known methods. To the best of our knowledge, this is
the first work that presents a unified approach for computing and querying several
statistical measures at once. Our approach exploits affine relationships using three
key components. First, the AFCLST algorithm clusters the time-series data, such
that high-quality affine relationships could be easily found. Second, the SYMEX
algorithm uses the clustered time series and efficiently computes the desired affine
relationships. Third, the SCAPE index structure produces a many-fold improvement
in the performance of processing several statistical queries by seamlessly indexing
the affine relationships. Finally, we establish the effectiveness of our approaches by
performing comprehensive experimental evaluation on real datasets.
THU/11
Forecasting the Data Cube: A Model Configuration Advisor for Multi-Dimensional
Data Sets
88
Ulrike Fischer, Christopher Schildt, Claudio Hartmann, Wolfgang Lehner (Dresden
University of Technology)
Forecasting time series data is crucial in a number of domains such as supply chain
management and display advertisement. In these areas, the time series to forecast is
typically organized along multiple dimensions leading to a high number of time series
that need to be forecasted. Most current approaches focus only on selection and
optimizing a forecast model for a single time series. In this paper, we explore how
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
we can utilize time series at different dimensions to increase forecast accuracy and,
optionally, reduce model maintenance overhead. Solving this problem is challenging
due to the large space of possibilities and possible high model creation costs. We
propose a model configuration advisor that automatically determines the best
set of models, a model configuration, for a given multi-dimensional data set. Our
approach is based on a general process that iteratively examines more and more
models and simultaneously controls the search space depending on the data set,
model type and available hardware. The final model configuration is integrated into
F2DB, an extension of PostgreSQL, that processes forecast queries and maintains
the configuration as new data arrives. We comprehensively evaluated our approach
on real and synthetic data sets. The evaluation shows that our approach significantly
increases forecast query accuracy while ensuring low model costs.
Seminar 7: Querying Encrypted Data
10:30 - 12PM
Odeon
Arvind Arasu, Ken Eguro, Raghav Kaushik, Ravi Ramamurthy (Microsoft Research)
Data security is a serious concern when we migrate data to a cloud DBMS.
Database encryption, where sensitive columns are encrypted before they are stored
in the cloud, has been proposed as a mechanism to address such data security
concerns. The intuitive expectation is that an adversary cannot ``learn’’ anything
about the encrypted columns, since she does not have access to the encryption key.
However, query processing becomes a challenge since it needs to ``look inside’’ the
data. This tutorial explores the space of designs studied in prior work on processing
queries over encrypted data. We cover approaches based on both classic clientserver and involving the use of a trusted hardware module where data can be
securely decrypted. We discuss the privacy challenges that arise in both approaches
and how they may be addressed. Briefly, supporting the full complexity of a modern
DBMS including complex queries, transactions and stored procedures leads to
significant challenges that we survey.
Panel: Big Data for the Public
10:30 - 12PM
Moderator: Dimitrios Georgakopoulos (CSIRO, Australia)
Ballroom 1-2
While data are now being produced and collected on unprecedented scales, most
of the “big data” remain inaccessible or difficult to use by the public. For example,
companies fervently guard the data they collect despite the potential for greater
public good. Lots of government data are supposedly public, but they are hard to
ICDE 2013 Conference
THU/11
Panelists: Karl Aberer (École Polytechnique Fédérale de Lausanne) Ashraf Aboulnaga
(University of Waterloo), Kevin Chang (University of Illinois at Urbana-Champaign),
Xin Luna Dong (Google Inc.)
89
DETAILED PROGRAM FOR THURSDAY 11 APRIL
access or analyze. Even if data are readily accessible (such as Web), obtaining reliable
information from noisy, high-volume, and heterogeneous data sources remains a
daunting task for the majority of the public. This panel is about the challenges in
fully realizing big data’s potential impact on the public. Topics of interest include
data quality, integration, privacy, as well as infrastructure, platform, and application
support for making access and analysis easier. Panelists will also discuss issues in
the practice of big-data research---how limited access to real data and workloads
affects reproducibility and robustness of research results, and how we can measure
research success and impact on the public.
Demo Groups 3 & 4
10:30 - 12PM
Ballroom 3
Πgora: An Integration System for Probabilistic Data
Dan Olteanu, Lampros Papageorgiou, Sebastiaan J. van Schaik (University of Oxford)
Πgora is an integration system for probabilistic data modelled using different
formalisms such as pc-tables, Bayesian networks, and stochastic automata. User
queries are expressed over a global relational layer and are evaluated by Πgora
using a range of strategies, including data conversion into one probabilistic formalism
followed by evaluation using an engine for that formalism, and hybrid plans, where
subqueries are evaluated using engines for different formalisms. This demonstration
allows users to experience Πgora on realworld heterogeneous data sources from
the medical domain.
THU/11
Complex Pattern Matching in Complex Structures: the XSeq Approach
90
Kai Zeng, Mohan Yang (University of California, Los Angeles) Barzan Mozafari
(Massachusetts Institute of Technology) Carlo Zaniolo (University of California, Los
Angeles)
There is much current interest in applications of complex event processing over
data streams and of complex pattern matching over stored sequences. While
some applications use streams of flat records, XML and various semi-structured
information formats are preferred by many others—in particular, applications that
deal with domain science, social networks, RSS feeds, and finance. XSeq and its
system improve complex pattern matching technology significantly, both in terms of
expressive power and efficient implementation. XSeq achieves higher expressiveness
through an extension of XPath based on Kleene-* pattern constructs, and achieves
very efficient execution, on both stored and streaming data, using Visibly Pushdown
Automata (VPA). In our demo, we will (i) show examples of XSeq in different
application domains, (ii) explain its compilation/query optimization techniques and
show the speed-ups they deliver, and (iii) demonstrate how powerful and efficient
application-specific languages were implemented by superimposing simple `skins’ on
XSeq and its system.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
T-Music: A Melody Composer based on Frequent Pattern Mining
Cheng Long, Raymond Chi-Wing Wong, Raymond Ka Wai Sze (The Hong Kong University
of Science and Technology)
There are a bulk of studies on proposing algorithms for composing the melody of
a song automatically with algorithms, which is known as algorithmic composition.
To the best of our knowledge, none of them took the lyric into consideration for
melody composition. However, according to some recent studies, within a song,
there usually exists a certain extent of correlation between its melody and its lyric.
In this demonstration, we propose to utilize this type of correlation information
for melody composition. Based on this idea, we design a new melody composition
algorithm and develop a melody composer called T-Music which employs this
composition algorithm.
SHARE: Secure Information Sharing Framework for Emergency Management
Barbara Carminati, Elena Ferrari, Michele Guglielmi (University of Insubria)
9/11, Katrina, Fukushima and other recent emergencies demonstrate the need
for effective information sharing across government agencies as well as nongovernmental and private organizations to assess emergency situations, and generate
proper response plans. In this demo, we present a system to enforce timely and
controlled information sharing in emergency situations. The framework is able to
detect emergencies, enforce temporary access control policies and obligations to
be activated during emergencies, simulate emergency situations for demonstrational
purposes and show statistical results related to emergency activation/deactivation
and consequent access control policies triggering.
KORS: Keyword-aware Optimal Route Search System
ICDE 2013 Conference
THU/11
Xin Cao, Lisi Chen, Gao Cong (Nanyang Technological University) Jihong Guan (Tongji
University) Nhan-Tue Phan, Xiaokui Xiao (Nanyang Technological University)
We present the Keyword-aware Optimal Route Search System (KORS), which
efficiently answers the KOR queries. A KOR query is to find a route such that it
covers a set of user-specified keywords, a specified budget constraint is satisfied, and
an objective score of the route is optimized. Consider a tourist who wants to spend
a day exploring a city. The user may issue the following KOR query: “a most popular
route such that it passes by shopping mall, restaurant, and pub, and the travel time
to and from her hotel is within 4 hours.” KORS provides browser-based interfaces
for desktop and laptop computers and provides a client application for mobile
devices as well. The interfaces and the client enable users to formulate queries and
view the query results on a map. Queries are then sent to the server for processing
by the HTTP post operation. Since answering a KOR query is NP-hard, we devise
two approximation algorithms with provable performance bounds and one greedy
algorithm to process the KOR queries in our KORS prototype. We use two real-
91
DETAILED PROGRAM FOR THURSDAY 11 APRIL
world datasets to demonstrate the functionality and performance of this system.
CrowdPlanr: Planning Made Easy with Crowd
Ilia Lotosh, Tova Milo, Slava Novgorodov (Tel-Aviv University)
Recent research has shown that crowd sourcing can be used effectively to solve
problems that are difficult for computers, e.g., optical character recognition and
identification of the structural configuration of natural proteins. In this demo we
propose to use the power of the crowd to address yet another difficult problem
that frequently occurs in a daily life - planning a sequence of actions, when the goal
is hard to formalize. For example, planning the sequence of places/attractions to
visit in the course of a vacation, where the goal is to enjoy the resulting vacation
the most, or planning the sequence of courses to take in an academic schedule
planning, where the goal is to obtain solid knowledge of a given subject domain.
Such goals may be easily understandable by humans, but hard or even impossible to
formalize for a computer. We present a novel algorithm for efficiently harnessing the
crowd to assist in solving such planning problems. The algorithm builds the desired
plans incrementally, optimally choosing at each step the `best’ questions so that the
overall number of questions that need to be asked is minimized. We demonstrate
the effectiveness of our solution in CrowdPlanr, a system for vacation travel planning.
Given a destination, dates, preferred activities and other constraints CrowdPlanr
employs the crowd to build a vacation plan (sequence of places to visit) that is
expected to maximize the “enjoyment” of the vacation.
THU/11
ASVTDECTOR: A Practical Near Duplicate Video Retrieval System
92
Xiangmin Zhou (CSIRO ICT Center) Lei Chen (Hong Kong University of Science and
Technology)
In this paper, we present a system, named ASVTDECTOR, to retrieve the near
duplicate videos with large variations based on an 3D structure tensor model,
named ASVT series, over the local descriptors of video segments. Different from
the traditional global feature-based video detection systems that incur severe
information loss, ASVT model is built over the local descriptor set of each video
segment, keeping the robustness of local descriptors. Meanwhile, unlike the
traditional local feature-based methods that suffer from the high cost of pairwise descriptor comparison, ASVT model describes a video segment as an 3D
structure tensor that is actually an 3X3 matrix, obtaining high retrieval efficiency. In
this demonstration, we show that, given a clip, our ASVTDETECTOR system can
effectively find the near-duplicates with large variations from a large collection in real
time.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
YumiInt - A Deep Web Integration System for Local Search Engines for Georeferenced Objects
Eduard Dragut (Purdue University) B. P. Beirne, A. Neyestani, B. Atassi, Clement Yu, Bhaskar
DasGupta (University of Illinois at Chicago) Weiyi Meng (Binghamton University)
We present YumiInt a deep Web integration system for local search engines for
Geo-referenced objects. YumiInt consists of two systems: YumiDev and YumiMeta.
YumiDev is an off-line integration system that builds the key components (e.g., query
translation and entity resolution) of YumiMeta. YumiMeta is the Web application to
which users post queries. It translates queries to multiple sources and gets back
aggregated lists of results. We present the two systems in this paper.
A Demonstration of the G* Graph Database System
Sean R. Spillane, Jeremy Birnbaum, Daniel Bokser, Daniel Kemp, Alan Labouseur, Paul W.
Olsen Jr., Jayadevan Vijayan, Jeong-Hyon Hwang (University at Albany - State University of
New York), Jun-Weon Yoon (KISTI Supercomputing Center)
The world is full of evolving networks, many of which can be represented by a
series of large graphs. Neither the current graph processing systems nor database
systems can efficiently store and query these graphs due to their lack of support
for managing multiple graphs and lack of essential graph querying capabilities. We
propose to demonstrate our system, G*, that meets the new challenges of managing
multiple graphs and supporting fundamental graph querying capabilities. G* can
store graphs on a large number of servers while compressing these graphs based
on their commonalities. G* also allows users to easily express queries on graphs
and efficiently executes these queries by sharing computations across graphs. During
our demonstrations, conference attendees will run various analytic queries on
large, practical data sets. These demonstrations will highlight the convenience and
performance benefits of G* over existing database and graph processing systems,
the effectiveness of sharing in graph data storage and processing, as well as G*’s
scalability.
RECODS: Replica Consistency-On-Demand Store
ICDE 2013 Conference
THU/11
Yuqing Zhu (Tsinghua University) Philip S. Yu (University of Illinois at Chicago) Jianmin
Wang (Tsinghua University)
Replication is critical to the scalability, availability and reliability of large-scale
systems. The trade-off of replica consistency vs. response latency has been widely
understood for large-scale stores with replication. The weak consistency guaranteed
by existing large-scale stores complicates application development, while the strong
consistency hurts application performance. It is desirable that the best consistency
be guaranteed for a tolerable response latency, but none of existing large-scale
stores supports maximized replica consistency within a given latency constraint.
In this demonstration, we showcase RECODS (REplica Consistency-On-Demand
93
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Store), a NoSQL store implementation that can finely control the trade-off on an
operation basis and thus facilitate application development with on-demand replica
consistency. With RECODS, developers can specify the tolerable latency for each
read/write operation. Within the specified latency constraint, a response will be
returned and the replica consistency be maximized. RECODS implementation is
based on Cassandra, an open source NoSQL store, but with a different operation
execution process, replication process and in-memory storage hierarchy.
SODIT: An Innovative System for Outlier Detection using Multiple Localized
Thresholding and Interactive Feedback
Ji Zhang, Hua Wang, Xiaohui Tao, Lili Sun (University of Southern Queensland)
Outlier detection is an important long-standing research problem in data mining
and has enjoyed applications in a wide range of applications in business, engineering,
biology and security, etc. However, the traditional outlier detection methods
inevitably need to use different parameters for detection such as those used to
specify the distance or density cutoff for distinguish outliers from normal data
points. Using the trial and error approach, the traditional outlier detection methods
are rather tedious in parameter tuning. In this demo proposal, we introduce an
innovative outlier detection system, called SODIT, that uses localized thresholding
to assist the value specification of the thresholds that reflect closely the local
data distribution. In addition, easy-to-use user feedback are employed to further
facilitate the determination of optimal parameter values. SODIT is able to make
outlier detection much easier to operate and produce more accurate, intuitive and
informative results than before.
THU/11
COLA: A Cloud-based System for Online Aggregation
94
Yantao Gan, Xiaofeng Meng, Yingjie Shi (Renmin University of China)
Online aggregation is a promising solution to achieving fast early responses for
interactive ad-hoc queries that compute aggregates on massive data. To process
large datasets on large-scale computing clusters, MapReduce has been introduced
as a popular paradigm into many data analysis applications. However, typical
MapReduce implementations are not well-suited to analytic tasks, since they are
geared towards batch processing. With the increasing popularity of ad-hoc analytic
query processing over enormous datasets, processing aggregate queries using
MapReduce in an online fashion is therefore an emerging important application
need. We present a MapReduce-based online aggregation system called COLA,
which provides progressive approximate aggregate answers for both single table
and multiple joined tables. COLA provides an online aggregation execution engine
with novel sampling techniques to support incremental and continuous computing
of aggregation, and minimize the waiting time before an acceptably precise
estimate is available. In addition, user-friendly SQL queries are supported in COLA.
Furthermore, COLA can implicitly convert non-OLA jobs into online version so that
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
users don’t have to write any special-purpose code to make estimates.
RoadAlarm: A Spatial Alarm System on Road Networks
Kisung Lee, Emre Yigitoglu, Ling Liu, Binh Han, Balaji Palanisamy, Calton Pu (Georgia
Institute of Technology)
Spatial alarms are one of the fundamental functionality for many LBSs. We argue that
spatial alarms should be road network aware as mobile objects travel on spatially
constrained road networks or walk paths. In this software system demonstration, we
will present the first prototype system of RoadAlarm - a spatial alarm processing
system for moving objects on road networks. The demonstration system of
RoadAlarm focuses on the three unique features of RoadAlarm system design. First,
we will show that the road network distance-based spatial alarm is best modeled
using road network distance such as segment length-based and travel time-based
distance. Thus, a road network spatial alarm is a star-like subgraph centered at the
alarm target. Second, we will show the suite of RoadAlarm optimization techniques
to scale spatial alarm processing by taking into account spatial constraints on road
networks and mobility patterns of mobile subscribers. Third, we will show that by
equipping the RoadAlarm system with an activity monitoring-based control panel,
we are able to enable the system administrator and the end users to visualize road
network-based spatial alarms, mobility traces of moving objects and dynamically
make selection or customization of the RoadAlarm techniques for spatial alarm
processing through graphical user interface. We show that the RoadAlarm system
provides both the general system architecture and the essential building blocks for
location-based advertisements and location-based reminders.
Real-time Abnormality Detection System for Intensive Care Management
ICDE 2013 Conference
THU/11
Guangyan Huang, Jing He (Victoria University) Jie Cao (Nanjing University of Finance and
Economics) Zhi Qiao (Chinese Academy of Sciences / Victoria University) Michael Steyn
(Royal Brisbane and Women's Hospital / Victoria University) Kersi Taraporewalla (Royal
Brisbane and Women's Hospital)
Detecting abnormalities from multiple correlated time series is valuable to those
applications where a credible real-time event prediction system will minimize
economic losses (e.g. stock market crash) and save lives (e.g. medical surveillance
in the operating theatre). For example, in an intensive care scenario, anesthetists
perform a vital role in monitoring the patient and adjusting the flow and type of
anesthetics to the patient during an operation. An early awareness of possible
complications is vital for an anesthetist to correctly react to a given situation. In
this demonstration, we provide a comprehensive medical surveillance system to
effectively detect abnormalities from multiple physiological data streams for assisting
online intensive care management. Particularly, a novel online support vector
regression (OSVR) algorithm is developed to approach the problem of discovering
the abnormalities from multiple correlated time series for accuracy and real-time
95
DETAILED PROGRAM FOR THURSDAY 11 APRIL
efficiency. We also utilize historical data streams to optimize the precision of the
OSVR algorithm. Moreover, this system comprises a friendly user interface by
integrating multiple physiological data streams and visualizing alarms of abnormalities.
Research 25: Lineage and Provenance
1:30 - 3PM
Chair: Zach Ives (University of Pennsylvania)
St Germaine
SubZero: a Fine-Grained Lineage System for Scientific Databases
Eugene Wu, Samuel Madden, Michael Stonebraker (Massachusetts Institute of
Technology)
Data lineage is a key component of provenance that helps scientists track and query
relationships between input and output data. While current systems readily support
lineage relationships at the file or data array level, finer-grained support at an arraycell level is impractical due to the lack of support for user defined operators and the
high runtime and storage overhead to store such lineage. We interviewed scientists
in several domains to identify a set of common semantics that can be leveraged
to efficiently store fine-grained lineage. We use the insights to define lineage
representations that efficiently capture common locality properties in the lineage
data, and a set of APIs so operator developers can easily export lineage information
from user defined operators. Finally, we introduce two benchmarks derived from
astronomy and genomics, and show that our techniques can reduce lineage query
costs by up to 10× while incurring substantially less impact on workflow runtime
and storage.
THU/11
Logical Provenance in Data-Oriented Workflows
96
Robert Ikeda, Akash Das Sarma, Jennifer Widom (Stanford University)
We consider the problem of defining, generating, and tracing provenance in
data-oriented workflows, in which input data sets are processed by a graph of
transformations to produce output results. We first give a new general definition
of provenance for general transformations, introducing the notions of correctness,
precision, and minimality. We then determine when properties such as correctness
and minimality carry over from the individual transformations’ provenance to
the workflow provenance. We describe a simple logical-provenance specification
language consisting of attribute mappings and filters. We provide an algorithm for
provenance tracing in workflows where logical provenance for each transformation
is specified using our language. We consider logical provenance in the relational
setting, observing that for a class of Select-Project-Join (SPJ) transformations, logical
provenance specifications encode minimal provenance. We have built a prototype
system supporting the features and algorithms presented in the paper, and we
report a few preliminary experimental results.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Revision Provenance in Text Documents of Asynchronous Collaboration
Jing Zhang (Twitter) H.V. Jagadish (University of Michigan)
Many text documents today are collaboratively edited, often with multiple small
changes. The problem we consider in this paper is how to find provenance for a
specific part of interest in the document. A full revision history, represented as a
version tree, can tell us about all updates made to the document, but most of these
updates may apply to other parts of the document, and hence not be relevant to
answer the provenance question at hand. In this paper, we propose the notion of a
revision unit as a flexible unit to capture the necessary provenance. We demonstrate
through experiments the capability of the revision units in keeping only relevant
updates in the provenance representation and the flexibility of the revision units in
adjusting to updates reflected in the version tree.
Research 26: Similarity Search
1:30 - 3PM
Chair: Tingjian Ge (University of Massachusetts at Lowell)
Bastille 1
Inverted Linear Quadtree: Efficient Top K Spatial Keyword Search
ICDE 2013 Conference
THU/11
Chengyuan Zhang, Ying Zhang, Wenjie Zhang (University of New South Wales) Xuemin
Lin (University of New South Wales / East China Normal University)
With advances in geo-positioning technologies and geo-location services, there is a
rapidly growing amount of spatio-textual objects collected in many applications such
as web search, location based service and social network service, in which an object
is described by its spatial location and a set of keywords (terms). Consequently,
the study of spatial keyword search which explores both location and textual
description of the objects has attracted great attention from the commercial
organizations and research communities. In the paper, we study the problem of top
k spatial keyword search (TOPK-SK), which is fundamental in the spatial keyword
queries. Given a set of spatio-textual objects, a query location and a set of keywords,
the top k spatial keyword search retrieves the closest k objects each of which
contains all keywords in the query. Based on the inverted index and linear quadtree
techniques, in the paper, we propose a novel index structure, called inverted linear
quadtree (IL-Quadtree), which is carefully designed to facilitate both spatial and
keyword based filtering to reduce the search space. An efficient algorithm is then
developed to tackle top k spatial keyword search. In addition, we show that the
IL-Quadtree technique can also be applied to improve the performance of other
spatial keyword queries such as the direction-aware top k spatial keyword search
and the spatio-textual ranking query. Comprehensive experiments on real and
synthetic data clearly demonstrate the efficiency of our methods.
97
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Similarity Query Processing for Probabilistic Sets
Ming Gao, Cheqing Jin (East China Normal University) Wei Wang (University of New
South Wales) Xuemin Lin (East China Normal University / University of New South
Wales) Aoying Zhou (East China Normal University)
Evaluating similarity between sets is a fundamental task in computer science.
However, there are many applications in which elements in a set may be uncertain
due to various reasons. Existing work on modeling probabilistic sets and computing
their similarities suffers from exponentially large model sizes or prohibitively costly
similarity computation, and hence is only applicable to tiny probabilistic sets. In this
paper, we propose a simple yet expressive model that supports many applications
where one probabilistic set may have thousands of, or hundreds of thousands
of elements. We define two types of similarities between two probabilistic sets
using the possible world semantics; they complement each other in capturing the
similarity distributions in the cross product of possible worlds. We design efficient
dynamic programming-based algorithms to calculate both types of similarities. Novel
individual and batch pruning techniques based on upper bounding the similarity
values are also proposed. To accommodate extremely large probabilistic sets, we
also design sampling-based approximate query processing methods with strong
probabilistic guarantees. We have conducted extensive experiments using both
synthetic and real datasets, and demonstrated the effectiveness and efficiency of our
proposed methods.
THU/11
Top-k String Similarity Search with Edit-Distance Constraints
98
Dong Deng, Guoliang Li, Jianhua Feng (Tsinghua University) Wen-Syan Li (SAP Labs
Shanghai)
String similarity search is a fundamental operation in many areas, such as data
cleaning, information retrieval, and bioinformatics. In this paper we study the
problem of top-k string similarity search with edit-distance constraints, which, given
a collection of strings and a query string, returns the top-k strings with the smallest
edit distances to the query string. Existing methods usually try different edit-distance
thresholds and select an appropriate threshold to find top-k answers. However
it is rather expensive to select an appropriate threshold. To address this problem,
we propose a progressive framework by improving the traditional dynamicprogramming algorithm to compute edit distance. We prune unnecessary entries
in the dynamic programming matrix and only compute those pivotal entries. We
extend our techniques to support top-k similarity search. We develop a rangebased method by grouping the pivotal entries to avoid duplicated computations.
Experimental results show that our method achieves high performance, and
significantly outperforms state-of-the-art approaches on real-world datasets.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Research 27: Shortest and Direct Query
1:30 - 3PM
Chair: Gao Cong (NTU)
Bastille 2
On Shortest Unique Substring Queries
Jian Pei (Simon Fraser University) Wush Chi-Hsuan Wu, Mi-Yen Yeh (Academia Sinica
Taiwan)
In this paper, we tackle a novel type of interesting queries — shortest unique
substring queries. Given a (long) string S and a query point q in the string, can we
find a shortest substring containing q that is unique in S? We illustrate that shortest
unique substring queries have many potential applications, such as information
retrieval, bioinformatics, and event context analysis. We develop efficient algorithms
for online query answering. First, we present an algorithm to answer a shortest
unique substring query in O(n) time using a suffix tree index, where n is the length
of string S. Second, we show that, using O(n · h) time and O(n) space, we can
compute a shortest unique substring for every position in a given string, where h
is variable theoretically in O(n) but on real data sets often much smaller than n
and can be treated as a constant. Once the shortest unique substrings are precomputed, shortest unique substring queries can be answered online in constant
time. In addition to the solid algorithmic results, we empirically demonstrate the
effectiveness and efficiency of shortest unique substring queries on real data sets.
Engineering Generalized Shortest Path Queries
ICDE 2013 Conference
THU/11
Michael N. Rice, Vassilis J. Tsotras (University of California, Riverside)
Generalized Shortest Path (GSP) queries represent a variant of constrained shortest
path queries in which a solution path of minimum total cost must visit at least one
location from each of a set of specified location categories (e.g., gas stations, grocery
stores) in a specified order. This problem type has many practical applications in
logistics and personalized location-based services, and is closely related to the
NP-hard Generalized Traveling Salesman Path Problem (GTSPP). In this work, we
present a new dynamic programming formulation to highlight the structure of this
problem. Using this formulation as our foundation, we progressively engineer a fast
and scalable GSP query algorithm for use on large, real-world road networks. Our
approach incorporates concepts from Contraction Hierarchies, a well-known graph
indexing technique for static shortest path queries. To demonstrate the practicality
of our algorithm we experimented on the North American road network (with
over 50 million edges) where we achieved up to several orders of magnitude speed
improvements over the previous-best algorithm, depending on the relative sizes of
the location categories.
99
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Efficient Direct Search on Compressed Genomic Data
Xiaochun Yang, Bin Wang (Northeastern University) Chen Li (University of California,
Irvine) Jiaying Wang (Northeastern University) Xiaohui Xie (University of California, Irvine)
The explosive growth in the amount of data produced by next-generation
sequencing poses significant computational challenges on how to store, transmit
and query these data, efficiently and accurately. A unique characteristic of the
genomic sequence data is that many of them can be highly similar to each other,
which has motivated the idea of compressing sequence data by storing only their
differences to a reference sequence, thereby drastically cutting the storage cost.
However, an unresolved question in this area is whether it is possible to perform
search directly on the compressed data, and if so, how. Here we show that directly
querying compressed genomic sequence data is possible and can be done efficiently.
We describe a set of novel index structures and algorithms for this purpose, and
present several optimization techniques to reduce the space requirement and query
response time. We demonstrate the advantage of our method and compare it
against existing ones through a thorough experimental study on real genomic data.
Research 28: Skyline and Snapshot Query
1:30 - 3PM
Chair: Reynold Cheng (University of Hong Kong)
Concorde
THU/11
On Answering Why-not Questions in Reverse Skyline Queries
100
Md. Saiful Islam, Rui Zhou, Chengfei Liu (Swinburne University of Technology)
This paper aims at answering the so called why-not questions in reverse skyline
queries. A reverse skyline query retrieves all data points whose dynamic skylines
contain the query point. We outline the benefit and the semantics of answering
why-not questions in reverse skyline queries. In connection with this, we show how
to modify the why-not point and the query point to include the why-not point in
the reverse skyline of the query point. We then show, how a query point can be
positioned safely anywhere within a region (i.e., called safe region) without losing
any of the existing reverse skyline points. We also show how to answer why-not
questions considering the safe region of the query point. Our approach efficiently
combines both query point and data point modification techniques to produce
meaningful answers. Experimental results also demonstrate that our approach can
produce high quality explanations for why-not questions in reverse skyline queries.
Layered Processing of Skyline-Window-Join (SWJ) Queries using Iteration-Fabric
Mithila Nagendra, K. Selçuk Candan (Arizona State University)
The problem of finding interesting tuples in a data set, more commonly known
as the skyline problem, has been extensively studied in scenarios where the
data is static. More recently, skyline research has moved towards data streaming
environments, where tuples arrive/expire in a continuous manner. Several algorithms
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
have been developed to track skyline changes over sliding windows; however,
existing methods focus on skyline analysis in which all required skyline attributes
belong to a single incoming data stream. This constraint renders current algorithms
unsuitable for applications that require a real-time “join” operation to be carried out
between multiple incoming data streams, arriving from different sources, before the
skyline query can be answered. Based on this motivation, in this paper, we address
the problem of computing skyline-window-join (SWJ) queries over pairs of data
streams, considering sliding windows that take into account only the most recent
tuples. In particular, we propose a Layered Skyline-window-Join (LSJ) operator that
(a) partitions the overall process into processing layers and (b) maintains skylinejoin results in an incremental manner by continuously monitoring the changes in
all layers of the process. We combine the advantages of existing skyline methods
(including those that efficiently maintain skyline results over a single stream, and
those that compute the skyline of pairs of static data sets) to develop a novel
iteration-fabric skyline-window-join processing structure. Using the iteration-fabric,
LSJ eliminates redundant work across consecutive windows by leveraging shared
data across all iteration layers of the windowed skyline-join processing. To the best of
our knowledge, this is the first paper that addresses join-based skyline queries over
sliding windows. Extensive experimental evaluations over real and simulated data
show that LSJ provides large gains over naive extensions of existing schemes which
are not designed to eliminate redundant work across multiple processing layers.
Efficient Snapshot Retrieval over Historical Graph Data
ICDE 2013 Conference
THU/11
Udayan Khurana, Amol Deshpande (University of Maryland)
We present a distributed graph database system to manage historical data for large
evolving information networks, with the goal to enable temporal and evolutionary
queries and analysis. The cornerstone of our system is a novel, user- extensible,
highly tunable, and distributed hierarchical index structure called “DeltaGraph”,
that enables compact recording of the historical network information, and that
supports efficient retrieval of historical graph snapshots for single-site or parallel
processing. Our system exposes a general programmatic API to process and analyze
the retrieved snapshots. Along with the original graph data, DeltaGraph can also
maintain and index “auxiliary” information; this functionality can be used to extend
the structure to efficiently execute queries like “subgraph pattern matching” over
historical data. We develop analytical models for both the storage space needed and
the snapshot retrieval times to aid in choosing the right construction parameters
for a specific scenario. We also present an in-memory graph data structure called
“GraphPool” that can maintain hundreds of historical graph instances in main
memory in a non-redundant manner. We present a comprehensive experimental
evaluation that illustrates the effectiveness of our proposed techniques at managing
historical graph information.
101
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Seminar 8: Shallow Information Extraction for the Knowledge Web
1:30 - 3PM
Odeon
Denilson Barbosa (University of Alberta) Haixun Wang (Microsoft Research Asia) Cong Yu
(Google Inc.)
A new breed of Information Extraction tools has become popular and shown to be
very effective in building massive-scale knowledge bases that fuel applications such
as question answering and semantic search. These approaches rely on Web-scale
probabilistic models populated through shallow language processing of the text, preexisting knowledge, and structured data already on the Web. This tutorial provides
an introduction to these techniques, starting from the foundations of information
extraction, and covering some of its key applications.
Demo Groups 3 & 4
1:30 - 3PM
See Demo Groups 3 & 4 on (p. 90) for demonstration details.
Ballroom 3
Research 29: Large Graph Indexing
3:30 - 5PM Chair: James Cheng (The Chinese University of Hong Kong) St Germaine
THU/11
FERRARI: Flexible and Efficient Reachability Range Assignment for Graph Indexing
102
Stephan Seufert, Avishek Anand (Max Planck Institute for Informatics) Srikanta Bedathur
(IIIT Delhi) Gerhard Weikum (Max Planck Institute for Informatics)
In this paper, we propose a scalable and highly efficient index structure for the
reachability problem over graphs. We build on the well-known node interval labeling
scheme where the set of vertices reachable from a particular node is compactly
encoded as a collection of node identifier ranges. We impose an explicit bound on
the size of the index and flexibly assign approximate
reachability ranges to nodes of the graph such that the number of index probes
to answer a query is minimized. The resulting tunable index structure generates a
better range labeling if the space budget is increased, thus providing a direct control
over the trade off between index size and the query processing performance. By
using a fast recursive querying method in conjunction with our index structure,
we show that in practice, reachability queries can be answered in the order of
microseconds on an off-the-shelf computer - even for the case of massive-scale real
world graphs. Our claims are supported by an extensive set of experimental results
using a multitude of benchmark and real-world web-scale graph datasets.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
gIceberg: Towards Iceberg Analysis in Large Graphs
Nan Li, Ziyu Guan, Lijie Ren (University of California at Santa Barbara) Jian Wu (Zhejiang
University) Jiawei Han (University of Illinois at Urbana-Champaign) Xifeng Yan (University
of California at Santa Barbara)
Traditional multi-dimensional data analysis techniques such as iceberg cube cannot
be directly applied to graphs for finding interesting or anomalous vertices due to
the lack of dimensionality in graphs. In this paper, we introduce the concept of graph
icebergs that refer to vertices for which the concentration (aggregation) of an
attribute in their vicinities is abnormally high. Intuitively, these vertices shall be “close”
to the attribute of interest in the graph space. Based on this intuition, we propose a
novel framework, called gIceberg, which performs aggregation using random walks,
rather than traditional SUM and AVG aggregate functions. This proposed framework
scores vertices by their different levels of interestingness and finds important
vertices that meet a user-specified threshold. To improve scalability, two aggregation
strategies, forward and backward aggregation, are proposed with corresponding
optimization techniques and bounds. Experiments on both real-world and synthetic
large graphs demonstrate that gIceberg is effective and scalable.
Top-k Graph Pattern Matching over Large Graphs
ICDE 2013 Conference
THU/11
Jiefeng Cheng (Chinese Academy of Sciences / Shenzhen Key Laboratory of High
Performance Data Mining) Xianggang Zeng (Chinese Academy of Sciences) Jeffrey Xu Yu
(The Chinese University of Hong Kong)
There exist many graph-based applications including bioinformatics, social science,
link analysis, citation analysis, and collaborative work. All need to deal with a large
data graph. Given a large data graph, in this paper, we study finding top-k answers
for a graph pattern query, and in particular, we focus on top-k cyclic graph queries
where a graph query is cyclic and can be complex. The capability of supporting top-k
graph pattern matching (kGPM) over a data graph provides much more flexibility
for a user to search graphs. And the problem itself is challenging. In this paper,
we propose a new framework of processing kGPM with on-the-fly ranked lists
based on spanning trees of the cyclic graph query. We observe a multidimensional
representation for using multiple ranked lists to answer a given kGPM query. Under
this representation, we propose a cost model to estimate the least number of tree
answers to be consumed in each ranked list in order to answer a given kGPM query
Q. This leads to a query optimization approach for kGPM processing, and a top-k
algorithm to process kGPM with the optimal query plan. We conducted extensive
performance studies using a real dataset, and we confirm the efficiency of our
proposed approach.
103
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Research 30: Web Data
3:30 - 5PM
Bastille 1
Breaking the Top-k Barrier of Hidden Web Databases
Saravanan Thirumuruganathan (University of Texas at Arlington) Nan Zhang (George
Washington University) Gautam Das (University of Texas at Arlington / Qatar Computing
Research Institute)
A large number of web databases are only accessible through proprietary formlike interfaces which require users to query the system by entering desired values
for a few attributes. A key restriction enforced by such an interface is the top-k
output constraint - i.e., when there are a large number of matching tuples, only a
few (top-k) of them are preferentially selected and returned by the website, often
according to a proprietary ranking function. Since most web database owners set k
to be a small value, the top-k output constraint prevents many interesting third-party
(e.g., mashup) services from being developed over real-world web databases. In this
paper we consider the novel problem of “digging deeper” into such web databases.
Our main contribution is the meta-algorithm GetNext that can retrieve the next
ranked tuple from the hidden web database using only the restrictive interface of
a web database without any prior knowledge of its ranking function. This algorithm
can then be called iteratively to retrieve as many top ranked tuples as necessary.
We develop principled and efficient algorithms that are based on generating and
executing multiple reformulated queries and inferring the next ranked tuple from
their returned results. We provide theoretical analysis of our algorithms, as well as
extensive experimental results over synthetic and real-world databases that illustrate
the effectiveness of our techniques.
THU/11
Automatic Extraction of Top-k Lists from the Web
104
Zhixian Zhang, Kenny Q, Zhu (Shanghai Jiao Tong University) Haixun Wang, Hongsong Li
(Microsoft Research Asia)
This paper is concerned with information extraction from top-k web pages, which
are web pages that describe top k instances of a topic, and usually the topic is of
general interest. Examples include ‘the 10 tallest buildings in the world”, “the 50 hits
of 2010 you don’t want to miss”, etc. Compared to other structured information
on the web (including web tables), information in top-k lists is larger and richer,
of higher quality, and generally more interesting. Therefore top-k lists are highly
valuable. For example, it can help enrich open-domain knowledge bases (to support
applications such as search or fact answering). In this paper, we present an efficient
method that extracts top-k lists from web pages with high performance. Specifically,
we extract more than 1.6 million top-k lists from a web corpus of 1.7 billion pages
with 92% precision and 72% recall.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Finding Interesting Correlations with Conditional Heavy Hitters
Katsiaryna Mirylenka, Themis Palpanas (University of Trento) Graham Cormode, Divesh
Srivastava (AT&T Labs - Research)
The notion of heavy hitters---items that make up a large fraction of the population--has been successfully used in a variety of applications across sensor and RFID
monitoring, network data analysis, event mining, and more. Yet this notion often fails
to capture the semantics we desire when we observe data in the form of correlated
pairs. Here, we are interested in items that are conditionally frequent: when a
particular item is frequent within the context of its parent item. In this work, we
introduce and formalize the notion of Conditional Heavy Hitters to identify such
items. We introduce several streaming algorithms, which allow us to find conditional
heavy hitters efficiently, and provide analytical results. Different algorithms are
successful for different input characteristics. We perform an experimental evaluation
to demonstrate the efficacy of our methods, and to study which algorithms are
most suited for different types of data.
Research 31: Query Optimization
3:30 - 5PM
Chair: Fabian Hueske (TU Berlin)
Bastille 2
Predicting Query Execution Time: Are Optimizer Cost Models Really Unusable?
ICDE 2013 Conference
THU/11
Wentao Wu (University of Wisconsin-Madison) Yun Chi, Shenghuo Zhu, Junichi Tatemura,
Hakan Hacıgümüş (NEC Laboratories America) Jeffrey Naughton (University of
Wisconsin Madison)
Predicting query execution time is useful in many database management issues
including admission control, query scheduling, progress monitoring, and system sizing.
Recently the research community has been exploring the use of statistical machine
learning approaches to build predictive models for this task. An implicit assumption
behind this work is that the cost models used by query optimizers are insufficient
for query execution time prediction. In this paper we challenge this assumption and
show while the simple approach of scaling the optimizer’s estimated cost indeed fails,
a properly calibrated optimizer cost model is surprisingly effective. However, even
a well-tuned optimizer cost model will fail in the presence of errors in cardinality
estimates. Accordingly we investigate the novel idea of spending extra resources to
refine estimates for the query plan after it has been chosen by the optimizer but
before execution. In our experiments we find that a well calibrated query optimizer
model along with cardinality estimation refinement provides a low overhead way to
provide estimates that are always competitive and often much better than the best
reported numbers from the machine learning approaches.
105
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Query Optimization for Differentially Private Data Management Systems
Shangfu Peng (University of Maryland) Yin Yang, Zhenjie Zhang (Advanced Digital
Sciences Center) Marianne Winslett (Advanced Digital Sciences Center / University of
Illinois at Urbana-Champaign) Yong Yu (Shanghai Jiao Tong University)
Differential privacy (DP) enables publishing the results of statistical queries over
sensitive data, with rigorous privacy guarantees, and very conservative assumptions
about the adversary’s background knowledge. This paper focuses on the interactive
DP framework, which processes incoming queries on the fly, each of which
consumes a portion of the user-specified privacy budget. Existing systems process
each query independently, which often leads to considerable privacy budget waste
and consequently fast exhaustion of the total budget. Motivated by this, we propose
Pioneer, a query optimizer for an interactive, DP-compliant DBMS. For each new
query, Pioneer creates an execution plan that combines past query results and new
results from the underlying data. When a query has multiple semantically equivalent
plans, Pioneer automatically selects one with minimal privacy budget consumption.
Extensive experiments confirm that Pioneer achieves significant savings of the
privacy budget, and can answer many more queries than existing systems for a fixed
total budget, with comparable result accuracy.
THU/11
Top Down Plan Generation: From Theory to Practice
106
Pit Fender, Guido Moerkotte (University of Mannheim)
Finding the optimal execution order of join operations is a crucial task of today’s
cost-based query optimizers. There are two approaches to identify the best plan:
bottom- up and top-down join enumeration. But only the top-down approach
allows for branch-and-bound pruning, which can improve compile time by several
orders of magnitude while still preserving optimality. For both optimization
strategies, efficient enumeration algorithms have been published. However, there
are two severe limitations for the top-down approach: The published algorithms can
handle only (1) simple (binary) join predicates and (2) inner joins. Since real queries
may contain complex join predicates involving more than two relations, and outer
joins as well as other non-inner joins, efficient top-down join enumeration cannot be
used in practice yet. We develop a novel top-down join enumeration algorithm that
overcomes these two limitations. Furthermore, we show that our new algorithm is
competitive when compared with the state of the art in bottom-up processing even
without playing out its advantage by making use of its branch-and-bound pruning
capabilities.
ICDE 2013 Conference
DETAILED PROGRAM FOR THURSDAY 11 APRIL
Research 32: Data Storage
3:30 - 5PM
Chair: Mohamed Sharaf (University of Queensland)
Concorde
TBF: A Memory-Efficient Replacement Policy for Flash-based Caches
Cristian Ungureanu, Biplob Debnath, Steve Rago, Akshat Aranya (NEC Laboratories
America)
The performance and capacity characteristics of flash storage make it attractive to
use as a cache. Recency-based cache replacement policies rely on an in-memory
full index, typically a B-tree or a hashtable, that maps each object to its recency
information. Even though the recency information itself may take very little space,
the full index for a cache holding N keys requires at least log N bits per key. This
metadata overhead is undesirably high when used for very large flash-based caches,
such as key-value stores with billions of objects. To solve this problem, we propose
a new recency-based RAM-frugal cache replacement policy that approximates
the least-recently-used (LRU) policy. It uses two in-memory Bloom sub-filters
(TBF) for maintaining the recency information and leverages an on-flash key-value
store to cache objects. TBF requires only one byte of RAM per cached object,
making it suitable for implementing very large flash-based caches. We evaluate
TBF through simulation on traces from several block stores and key-value stores,
as well as evaluate it using the Yahoo! Cloud Serving Benchmark in a real system
implementation. Evaluation results show that TBF achieves cache hit rate and
operations per second comparable to those of LRU in spite of its much smaller
memory requirements.
Fast Peak-to-Peak Behavior with SSD Buffer Pool
ICDE 2013 Conference
THU/11
Jaeyoung Do (University of Wisconsin -Madison) Donghui Zhang (Paradigm4) Jignesh M.
Patel (University of Wisconsin-Madison) David DeWitt (Microsoft Jim Gray Systems Lab)
A promising use of flash SSDs in a DBMS is to extend the main memory buffer
pool by caching selected pages that have been evicted from the buffer pool. Such
a use has been shown to produce significant gains in the steady state performance
of the DBMS. One strategy for using the SSD buffer pool is to throw away the data
in the SSD when the system is restarted (either when recovering from a crash or
restarting after a shutdown), and consequently a long “ramp-up” period to regain
peak performance is needed. One approach to eliminate this limitation is to use a
memory-mapped file to store the SSD buffer table in order to be able to restore its
contents on restart. However, this design can result in lower sustained performance,
because every update to the SSD buffer table may incur an I/O operation to the
memory-mapped file. In this paper we propose two new alternative designs. One
design reconstructs the SSD buffer table using transactional logs. The other design
asynchronously flushes the SSD buffer table, and upon restart, lazily verifies the
integrity of the data cached in the SSD buffer pool. We have implemented these
107
DETAILED PROGRAM FOR THURSDAY 11 APRIL
three designs in SQL Server 2012. For each design, both the write-through and
write-back SSD caching policies were implemented. Using two OLTP benchmarks
(TPC-C and TPC-E), our experimental results show that our designs produce up to
3.8X speedup on the interval between peak-to-peak performance, with negligible
performance loss; in contrast, the previous approach has a similar speedup but up to
54% performance loss.
SELECT Triggers for Data Auditing
Daniel Fabbri (University of Michigan) Ravi Ramamurthy, Raghav Kaushik (Microsoft
Research)
Auditing is a key part of the security infrastructure in a database system. While
commercial database systems provide mechanisms such as triggers that can be used
to track and log any changes made to ``sensitive’’ data using UPDATE queries, they
are not useful for tracking accesses to sensitive data using complex SQL queries,
which is important for many applications given recent laws such as HIPAA. In this
paper, we propose the notion of SELECT triggers that extends triggers to work
for SELECT queries in order to facilitate data auditing. We discuss the challenges in
integrating SELECT triggers in a database system including specification, semantics
as well as efficient implementation techniques. We have prototyped our framework
in a commercial database system and present an experimental evaluation of our
framework using the TPC-H benchmark.
THU/11
Seminar 9: Secure and Privacy-Preserving Database Services in the
Cloud
108
3:30 - 5PM
Odeon
Divyakant Agrawal, Amr El Abbadi, Shiyuan Wang (University of California, Santa Barbara)
Cloud computing becomes a very successful paradigm for data computing and
storage. Increasing concerns about data security and privacy in the cloud, however,
have arisen. Ensuring security and privacy for data management and query
processing in the cloud is critical for better and broader uses of the cloud. This
tutorial covers recent research on cloud security and privacy, while focusing on the
works that protect data confidentiality and query access privacy for sensitive data
being stored and queried in the cloud. We provide a comprehensive study of stateof-the-art schemes and techniques for protecting data confidentiality and access
privacy, and explain their tradeoffs in security, privacy, functionality and performance.
Poster Session
3:30 - 6PM
Ballroom 1 & 2
ICDE 2013 Conference
TRANSPORT
How to travel to/from airport and venue
Train
The Airtrain runs to/from Brisbane Domestic and International Airports, with
travel time of just 22 minutes to Central Station. A one-way single adult ticket costs
AUD$15.00. For the Airtrain timetable please visit translink.com.au
Weekends: Airport to Central Station: First train 6am – Last train 10pm
Central Station to Airport: First train 5am – Last train 9pm
Mon-Fri:
Airport to Central Station: First train 5.40am – Last train 10pm
Central Station to Airport: First train 5am – Last train 9pm
Taxi
Fares vary due to distance, traffic conditions and time, however, you can anticipate
that a fare to / from Brisbane’s CBD and Brisbane Airport will total approximately
AUD$40.00.
Public Transport in Brisbane – Buses, Trains and Ferries
For timetables, journey planner and other details, go to: translink.com.au
go card is TransLink's electronic ticket. It allows you to travel seamlessly on all bus, train
and ferry services. Buy your go card at train stations, most news agencies or online via
https://gocard.translink.com.au/webtix/
ICDE 2013 Conference
109
SOCIAL PROGRAM
IEEE TCDE Members Reception
Monday 8 April
7 - 9 PM at the Summit Restaurant,
1012 Sir Samuel Griffith Drive, Brisbane Lookout, Mt Coot-tha
6:30pm bus pickup from ICDE conference hotel (Sofitel), and departure from
restaurant back to Sofitel at 9.15pm.
New and current TCDE members are invited to join us for the 2013 TCDE
Members Reception.
Welcome Reception
Tuesday 9 April
5:30 - 7 PM in the Ann Street Lobby, Sofitel Hotel
Banquet
Wednesday 10 April
6:30 - 10 PM at Brisbane City Hall
Main Auditorium, King George Square, Brisbane CBD
The Brisbane City Hall is located 600m, or an east 8 minute walk, from the Sofitel
Hotel. Directions available from the Registration Desk.
Posters & Farewell Drinks
Thursday 11 April
Posters: 3.30 – 6pm
with drinks from 5 - 6 PM in Ballroom 1 & 2, Sofitel Hotel
110
ICDE 2013 Conference
REGISTRATION & INFORMATION DESK
The registration & information desk for the conference is located in the Ann Street
Lobby of the Sofitel Hotel.
The information desk will be open at the following times:
Sunday 7 April: 4 PM – 8 PM
Monday 8 April: 7 AM – 9 PM
Tuesday 9 April: 7 AM – 7 PM
Wednesday 10 April: 8 AM – 6 PM
Thursday 11 April: 8 AM – 6PM
Event Coordinator: Kathleen Williamson
Phone: 0401 477 509
Email: [email protected]
Volunteers
Volunteers will be available to help with any questions during the conference. They
may be identified by their black ICDE-13 shirts.
ICDE-13 Student Travel Award Winners
To claim the AUD600 award, please email [email protected], or visit the
Registration Desk during the conference.
Internet Access
Free Wi-Fi Internet access is provided on the conference floor for delegates. For
access details please visit the Registration Desk.
Handy Brisbane Apps
Including AirTrain, bikes, taxis, public transport, maps, news, weather and food:
www.brisbanemarketing.com.au/Resources/Convention-Support-Toolkit/pages/
Delegate-Experience/Handy-Brisbane-Tourist-Apps
ICDE 2014
ICDE 2014 will be held from 07 Apr - 11 Apr 2014 at the Intercontinental Marriott
Downtown, 540 North Michigan Avenue Chicago, IL, USA. Please contact Goce
Trajcevski - [email protected] - Northwestern University
ICDE 2013 Conference
111
VOLUNTEERS
ICDE-13 would like to extend our warm appreciation to our conference volunteers
who assisted before, during and after the conference, to help make sure that
everyone enjoys a great conference experience. These volunteers welcome
participants, give directions, help in the sessions and on the registration desk,
and generally make sure the conference is running smoothly. At the conference,
volunteers may be identified by their black ICDE-13 shirts.
Chao Gu, The University of Queensland
Guanfeng Liu, Macquarie University
Hamed Hassanzadeh, The University of Queensland
Han Ada Su, The University of Queensland
Haozhou Wang, The University of Queensland
Hongyun Cai, The University of Queensland
Jiajie Yue, The University of Queensland
Jiping Tracy Wang, The University of Queensland
Kun Zhao, The University of Queensland
Liangchen Liu, The University of Queensland
Litao Yu, The University of Queensland
Marina Drosou, University of Ioannina
Mukhammad Andri Setiawan, The University of Queensland
Sayan Unankard, The University of Queensland
Vinita Nahar, The University of Queensland
Xuefei Li, The University of Queensland
Yunfei Shi, UQ Business School
112
ICDE 2013 Conference
ICDE 2013 COMMITTEES
Organizing Committee
General Chairs
Rao Kotagiri (The University of Melbourne, Australia)
Beng Chin Ooi (National University of Singapore, Singapore)
Program Chairs
Christian S. Jensen (Aarhus University, Denmark)
Chris Jermaine (Rice University, USA)
Xiaofang Zhou (The University of Queensland, Australia)
Workshop Chairs
Chee Yong Chan (National University of Singapore, Singapore)
Kjetil Nørvåg (Norwegian University of Science and Technology, Norway)
Proceedings Chairs
Jiaheng Lu (Renmin University of China, China)
Egemen Tanin (The University of Melbourne, Australia)
Industry Chairs
Sang Cha (Seoul National University, Korea)
Haixun Wang (Microsoft Research Asia, China)
Ph.D. Symposium Chairs
Gottfried Vossen (University of Münster, Germany)
Min Wang (HP Labs China, China)
Seminar Chair
Alexandros Labrinidis (University of Pittsburgh, USA)
Panel Chairs
Dimitrios Georgakopoulos (CSIRO, Australia)
Jun Yang (Duke University, USA)
Demo Chairs
Yoshiharu Ishikawa (Nagoya University, Japan)
Rui Zhang (The University of Melbourne, Australia)
Yanchun Zhang (Victoria University, Australia)
Poster Chair
ICDE 2013 Conference
113
Wook-Shin Han (Kyungpook National University, Korea)
Local Organization Chairs
Shazia Sadiq (The University of Queensland, Australia)
Heng Tao Shen (The University of Queensland, Australia)
Finance Chair
Marta Indulska (The University of Queensland, Australia)
Web and Publicity Chair
Mohamed Sharaf (The University of Queensland, Australia)
Program Committee
Program Chairs
Christian S. Jensen (Aarhus University, Denmark)
Chris Jermaine (Rice University, USA)
Xiaofang Zhou (The University of Queensland, Australia)
Track Chairs
Wolfgang Lehner (Dresden University of Technology, Germany)
Data warehousing, analytics, MapReduce, and big data
Xin (Luna) Dong (AT&T Labs - Research, USA)
Data Integration, metadata management, interoperability
Jian Pei (Simon Fraser University, Canada)
Data Mining and knowledge discovery: algorithms
Srinivasan Parthasarathy (Ohio State University, USA)
Data Mining and knowledge discovery: applications
Panos Chrysanthis (University of Pittsburgh, USA)
Cloud infrastructure, mobile, distributed, and peer-to-peer data management
Paul Larson (Microsoft Research, USA)
Indexing and Storage
Pierangela Samarati (University of Milan, Italy)
Privacy and Security
Amol Deshpande (University of Maryland, USA)
Query processing and query optimization
Magda Balazinska (University of Washington, USA)
Scientific data and data visualization
Jeffrey Xu Yu (The Chinese University of Hong Kong, China)
Semistructured data, RDF, XML
114
ICDE 2013 Conference
ICDE 2013 COMMITTEES
Xifeng Yan (University of California at Santa Barbara, USA)
Social networks, web, and personal information management
Simonas Saltenis (Aalborg University, Denmark)
Spatial, temporal, and multimedia data
Nesime Tatbul (ETH Zurich, Switzerland)
Streams, sensor networks, and complex events processing
Alan Fekete (The University of Sydney, Australia)
Systems, performance, and transaction management
Vagelis Hristidis (University of California at Riverside, USA)
Text, graphs, and search
Xuemin Lin (The University of New South Wales, Australia)
Uncertain and probabilistic data
Research Program Committee Members
Karl Aberer, EPFL, Switzerland
Ashraf Aboulnaga, University of Waterloo,
Canada
Yanif Ahmad, Johns Hopkins University,
USA
Gustavo Alonso, ETH Zurich, Switzerland
Walid Aref, Purdue University, USA
Ismail Ari, Ozyegin University, Turkey
Ira Assent, Aarhus University, Denmark
Sitaram Asur, HP Research, USA
Shivnath Babu, Duke University, USA
Torben Bach Pedersen, Aalborg University,
Denmark
James Bailey, The University of Melbourne,
Australia
Phil Bernstein, Microsoft Research, USA
Sourav Saha Bhowmick, Nanyang
Technological University, Singapore
Peter Boncz, CWI, The Netherlands
K. Selcuk Candan, Arizona State
University, USA
Kaushik Chakrabarti, Microsoft Research,
USA
ICDE 2013 Conference
Badrish Chandramouli, Microsoft
Research, USA
Kevin Chang, University of Illinois at
Urbana-Champaign, USA
Lijun Chang, Chinese University of Hong
Kong, China
Sanjay Chawla, The University of Sydney,
Australia
Lei Chen, Hong Kong University of
Science and Technology, China
Shimin Chen, HP Labs China, China
Yi Chen, Arizona State University, USA
Hong Cheng, City University of Hong
Kong, China
James Cheng, Nanyang Technological
University, Singapore
Reynold Cheng, University of Hong Kong,
China
Tao Cheng, Microsoft Research, USA
Paolo Ciaccia, University of Bologna, Italy
Graham Cormode, AT&T Labs Research,
USA
Bin Cui, Peking University, China
115
ICDE 2013 COMMITTEES
Judith Cushing, The Evergreen State
College, USA
Gautam Das, UT Arlington & QCRI, USA
Sudipto Das, Microsoft Research, USA
Khuzaima Daudjee, University of
Waterloo, Canada
Sabrina De Capitani di Vimercati,
Universita' degli Studi di Milano, Italy
Alex Delis, University of Athens, Greece
Josep Domingo-Ferrer, Universitat Rovira
i Virgili, Spain
Sameh Elnikety, Microsoft Research, USA
Ling Feng, Tsinghua University, China
Peter Fischer, University of Freiburg,
Germany
George Fletcher, Eindhoven University of
Technology, The Netherlands
Johann Christoph Freytag, HumboldtUniversitaet zu Berlin, Germany
Keith Frikken, Miami University, USA
Tingjian Ge, University of Massachusetts
at Lowell, USA
Bugra Gedik, Bilkent University, Turkey
Gabriel Ghinita, University of
Massachusetts at Boston, USA
Amol Ghoting, IBM Research, USA
Aristides Gionis, Yahoo! Research, USA
Lukasz Golab, University of Waterloo,
Canada
Ralf Hartmut Güting, Fernuniversitat
Hagen, Germany
Hakan Hacigumus, NEC Labs, USA
Jiawei Han, University of Illinois at UrbanaChampaign, USA
Jan Hidders, Delft University of
Technology, The Netherlands
116
Bill Howe, University of Washington, USA
Helen Huang, The University of
Queensland, Australia
Ihab Ilyas, Qatar Computing Research
Institute, Qatar
Raghav Kaushik, Microsoft Research, USA
Bettina Kemme, McGill University, Canada
Martin Kersten, CWI Amsterdam, The
Netherlands
Nick Koudas, University of Toronto,
Canada
Georgia Koutrika, HP Labs, USA
Tim Kraska, University of California
Berkeley, USA
Peer Kröger, LMU Munich, Germany
Harumi Kuno, HP Labs, USA
Ashwin Lall, Denison University, USA
Adam Lee, University of Pittsburgh, USA
Akoglu Leman, Carnegie Mellon, USA
Chen Li, University of California at Irvine,
USA
Chengkai Li, University of Texas at
Arlington, USA
Feifei Li, University of Utah, USA
Guoliang Li, Tsinghua University, China
Tao Li, Florida International University,
USA
Chengfei Liu, Swinburne University of
Technology, Australia
David Lomet, Microsoft Research, USA
Hua Lu, Aalborg University, Denmark
Qiong Luo, Hong Kong University of
Science and Technology, China
Shuai Ma, Beihang University, China
Nikos Mamoulis, University of Hong Kong,
China
ICDE 2013 Conference
ICDE 2013 COMMITTEES
Yannis Manolopoulos, Aristotle University
of Thessaloniki, Greece
Claudia Medeiros, University of Campinas,
Brazil
Sharad Mehrotra, University of California
at Irvine, USA
Alexandra Meliou, University of
Washington, USA
Mohamed Mokbel, University of
Minnesota, USA
Bongki Moon, University of Arizona, USA
Barzan Mozafari, MIT, USA
Arnab Nandi, Ohio State University, USA
Mario Nascimento, University of Alberta,
Canada
Thomas Neumann, Technische Universitat
Munchen, Germany
Raymond Ng, University of British
Columbia, Canada
Alexandros Ntoulas, UCLA, USA
Mitsunori Ogihara, University of Miami,
USA
Dan Olteanu, Oxford University, UK
Carlos Ordonez, University of Houston,
USA
Ippokratis Pandis, IBM Research, USA
Spiros Papadimitriou, Google, USA
Olga Papaemmanouil, Brandeis University,
USA
Stefano Paraboschi, Universita' degli Studi
di Bergamo, Italy
Marta Patiño-Martínez, Universidad
Politécnica de Madrid, Spain
Peter Pietzuch, Imperial College London,
UK
Evaggelia Pitoura, University of Ioannina,
Greece
ICDE 2013 Conference
Rachel Pottinger, University of British
Columbia, Canada
Lu Qin, Chinese University of Hong Kong,
China
Venkatesh Raghavan, Greenplum/EMC,
USA
Jorge-Arnulfo Quiané-Ruiz, University of
Saarland, Germany
Christopher Re, University of Wisconsin,
USA
Matthias Renz, Ludwig-Maximilians
University Munich, Germany
Florin Rusu, University of California,
Merced, USA
Kai-Uwe Sattler, Ilmenau University of
Technology, Germany
Venu Satuluri, Twitter, USA
Thomas Seidl, RWTH Aachen University,
Germany
Sudipta Sengupta, Microsoft Research,
USA
Mohamed Sharaf, The University of
Queensland, Australia
Jialie Shen, Singapore Management
University, Singapore
Yasin Silva, Arizona State University, USA
Manas Somaiya, eBay Inc, USA
Julia Stoyanovich, University of
Pennsylvania, USA
Kian-Lee Tan, National University of
Singapore, Singapore
Nan Tang, Qatar Computing Research
Institute, Qatar
Yufei Tao, Chinese University of Hong
Kong, China
Arash Termehchy, University of Illinois at
Urbana-Champaign, USA
Evimaria Terzi, Boston University, USA
117
ICDE 2013 COMMITTEES
Jens Teubner, ETH Zurich, Switzerland
Hanghang Tong, IBM Research, USA
Vincent Tseng, National Cheng Kung
University, Taiwan
Kostas Tzoumas, Technical University of
Berlin, Germany
Marcos Vaz Salles, University of
Copenhagen, Denmark
Akrivi Vlachou, Athens University of
Economics and Business, Greece
Jianyong Wang, Tsinghua University, China
Wei Wang, University of North Carolina,
USA
Yuqing Melanie Wu, Indiana University,
USA
Hui Xiong, Rutgers University, USA
Fei Xu, Microsoft Search, USA
Bin Yang, Aarhus University, Denmark
Yin Yang, Advanced Digital Sciences
Center, USA
Mi-yen Yeh, SINICA, Taiwan
Man Lung Yiu, Hong Kong Polytechnic
University, China
Hwanjo Yu, Pohang University of Science
and Technology, Korea
Demetris Zeinalipour-Yazti, University of
Cyprus, Cyprus
Rui Zhang, The University of Melbourne,
Australia
Wenjie Zhang, The University of New
South Wales, Australia
Ying Zhang, The University of New
South Wales, Australia
Peixiang Zhao, University of Illinois at
Urbana-Champaign, USA
Zhi-Hua Zhou, Nanjing University, China
Feida Zhu, Singapore Management
University, Singapore
Industry Program Chairs
Sang Cha, Seoul National University, Korea
Haixun Wang, Microsoft Research Asia, China
Industry Program Committee Members
Athman Bouguettaya, RMIT, Australia
Brian Cooper, Google, USA
Carsten Binnig, DHBW Mannheim, Germany
Changkyu Kim, Intel, USA
Christof Bornhoevd, SAP Labs Palo Alto, USA
Fabian Suchanek, Max Planck Institute for Informatics, Germany
Russell Sears, Microsoft Research, USA
Sameh Elnikety, Microsoft Research, USA
Vincent Tseng, National Cheng Kung University, Taiwan
Xing Xie, Microsoft Research Asia, China
Yan Huang, University of North Texas, USA
118
ICDE 2013 Conference
ICDE 2013 COMMITTEES
Yanghua Xiao, Fudan University, China
Demo Program Chairs
Yoshiharu Ishikawa, Nagoya University, Japan
Rui Zhang, The University of Melbourne, Australia
Yanchun Zhang, Victoria University, Australia
Demo Program Committee Members
Sourav S. Bhowmick, Nanyang Technological University, Singapore
Christian Böhm, University of Munich, Germany
Malu Castellanos, HP Labs, USA
Wojciech Cellary, Poznan University of Economics, Poland
Jidong Chen, EMC Research China Lab, China
Reynold Cheng, University of Hong Kong, China
Gao Cong, Nanyang Technological University, China
Elena Ferrari, University of Insubria, Italy
Jing He, Victoria University, Australia
Zhen He, La Trobe University, Australia
Mizuho Iwaihara, Waseda University, Japan
Sun Kim, Seoul National University, Korea
Christian Konig, Microsoft Research, USA
Dan Lin, Missouri University of Science and Technology, USA
Eric Lo, Hong Kong Polytechnic University, China
Weiyi Meng, State University of New York at Binghamton, USA
Xiaofeng Meng, Renmin University of China, China
Jun Miyazaki, Nara Institute of Science and Technology, Japan
Kyriakos Mouratidis, Singapore Management University, Singapore
Emmanuel Müller, Karlsruhe Institute of Technology, Germany
Timos Sellis, National Technical University of Athens, Greece
David Taniar, Monash University, Australia
Hua Wang, The University of Southern Queensland, Australia
Wei Wang, The University of New South Wales, Australia
Lexing Xie, Australian National University, Australia
Xiaohui Yu, York University, Canada
Zhenjie Zhang, Advanced Digital Sciences Center, USA
ICDE 2013 Conference
119
NOTES
120
ICDE 2013 Conference
NOTES
ICDE 2013 Conference
121
NOTES
122
ICDE 2013 Conference
NOTES
ICDE 2013 Conference
123
NOTES
124
ICDE 2013 Conference
visitbrisbane.com.au
EVENT HIGHLIGHTS
Brisbane
APRIL
play
on
Until 14 Apr
The 7th
Asia Pacific
Triennial of
Contemporary
Art (APT7)
QAGOMA
Until Jul
Until Sep
Queensland
Reds Season
Brisbane Broncos
Season 2013
Suncorp Stadium
Suncorp Stadium
From 6 Apr
Brisbane Lions
Season 2013
The Gabba
MAY
Sophisticated and sporty, haute and hot, Brisbane packs
a lot in to a short stay. From riverside dining and farmers
markets, to free laughs and a world-class line-up of events,
Brisbane has you covered.
LIVE MUSIC
Find out what makes Brisbane
tick on a daily Brisbane
Greeters walking tour led by
passionate and in-the-know
locals. Leaving daily at 10am
from the Visitor Information
Centre, Queen Street Mall.
The live music scene in
Brisbane is well and truly
aLIVE! Check out The Tivoli,
Black Bear Lodge and Ric’s
Bar in Fortitude Valley or
The Hi-Fi in West End for live
performances.
FARMERS MARKETS
FUN FOR FREE
Jan Power’s Farmers Market
is a colourful, bustling, open
air market selling fresh farm
produce, flowers, breads,
meat, fish, poultry, plants and
organics. Every Wednesday in
Redacliff Place, The City.
For free tunes and laughs, the
Brisbane Powerhouse is your
destination. Every Saturday
and Sunday be entertained by
comedians and musicians at
Saturday Sessions and
Livewired.
Doomben & Eagle Farm
Racecourses
30 May – 9 Jun
Bolshoi Ballet
Queensland
Performing Arts
Centre
8 & 22 Jun
16 Jun
British & Irish
Lions
City2South
Suncorp Stadium
Brisbane City & South
Bank
26 Jun
State of Origin
Suncorp Stadium
From 6 Jul
4 Aug
War Horse
Brisbane
Marathon
Festival
Queensland
Performing Arts
Centre
Brisbane City
Image credits: Bolshoi Ballet: Le Corsaire © Damir Yusupov. APT7: MadeIn Company /
Spread 201009103 (detail) 2010 / Image courtesy: The artists.
GREETERS
Brisbane Racing
Carnival
AUGUST
more to explore
11 May – 8 Jun
Anywhere
Theatre
Festival
Homes, parks,
shops…anywhere
J U NE
Fresh breezes and sun-kissed water make very pleasant
companions at some of the city’s best riverside restaurants. With
interiors by Brisbane’s internationally regarded Anna Spiro, Mr &
Mrs G Riverbar on Eagle Street Pier offers stunning views on the
inside and out. It’s the latest place to go for cocktails, tapas and
panoramic views of the Brisbane River and Story Bridge. Across
the river at South Bank, the newly opened River Quay precinct
includes hot new contenders Stokehouse, Popolo, The Jetty
and Cove Bar.
J U LY
Bites by the Water
8 – 19 May
Information correct at time of printing.
facebook.com/visitbrisbane
BRONZE
SILVER
GOLD
PLATINUM
DIAMOND
THANK YOU TO OUR PATRONS AND SUPPORTERS!