Download Models and Sensor Networks

Probabilistic Databases Amol Deshpande, University of Maryland Overview  V.S. Subrahmanian   Lise Getoor   ProbView, PXML, Temporal Probabilistic Databases, Probabilistic Aggregates Statistical Relational Learning, Probabilistic Relational Models, Entity Resolution Amol  MauveDB: Statistical Modeling in Databases, Correlated tuples in probabilistic databases Overview of Today’s Presentation  Model-based Views/MauveDB [Amol]  Statistical Relational Learning [Lise]  Representing arbitrarily correlated data and processing queries over it [Prithviraj] Overview of Today’s Presentation  Model-based Views/MauveDB [Amol]  Goal: Making it easy to continuously apply statistical models to streaming data  Current focus on designing declarative interfaces, and on efficient maintenance algorithms  Less on the “probabilistic databases” issues  Statistical Relational Learning [Lise]  Representing arbitrarily correlated data and processing queries over it [Prithviraj] Motivation  Unprecedented, and rapidly increasing, Wireless sensor networks instrumentation of our every-day world  Huge data volumes generated continuously that must be processed in real-time  Typically imprecise, unreliable and incomplete Distributed measurement networks (e.g. GPS) data  Measurement noises, low success rates, failures etc… RFID Industrial Monitoring Data Processing Step 1  Process data using a statistical/probabilistic model  Regression and interpolation models   To eliminate spatial or temporal biases, handle missing data, prediction Filtering techniques (e.g. Kalman Filters), Bayesian Networks  To eliminate measurement noise, to infer hidden variables etc Temperature monitoring Regression/interpolation models GPS Data Kalman Filters et A Motivating Example  Inferring “transportation mode”/ “activities” [Henry Kautz et al]  Using easily obtainable sensor data, e.g. GPS, RFID proximity data  Can do much if we can infer these automatically home office Have access to noisy “GPS” data Infer the transportation mode: walking, running, in a car, in a bus Motivating Example  Inferring “transportation mode”/ “activities” [Henry Kautz et al]  Using easily obtainable sensor data, e.g. GPS, RFID proximity data  Can do much if we can infer these automatically home office Preferred end result: Clean path annotated with transportation mode Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t Transportation Mode: Walking, Running, Car, Bus Mt True velocity and location Xt Need conditional probability distributions e.g. a distribution on (velocity, location) given the transportation mode Prior knowledge or learned from data Observed location Ot Dynamic Bayesian Network Use a “generative model” for describing how the observations were generated Time = t Time = t+1 Transportation Mode: Walking, Running, Car, Bus Mt Mt+1 True velocity and location Xt Xt+1 Ot Ot+1 Observed location Dynamic Bayesian Network Given a sequence of observations (Ot), find the most likely Mt’s that explain it. Or could provide a probability distribution on the possible Mt’s. Time = t Time = t+1 Transportation Mode: Walking, Running, Car, Bus Mt Mt+1 True velocity and location Xt Xt+1 Ot Ot+1 Observed location Statistical Modeling of Sensor Data   No support in database systems --> Database ends up being used as a backing store  With much replication of functionality  Very inefficient, not declarative… How can we push statistical modeling inside a database system ? Abstraction: Model-based Views  An abstraction analogous to traditional database views  Present the output of the application of model as a database view  That the user can query as with normal database views Example DBN View User User Time Location Mode prob John 5pm (x’1, y’1) Walking 0.9 John 5pm (x’1, y’1) Car 0.1 John 5:05pm (x’2, y’2) Walking 0 John 5:05pm (x’2, y’2) Car 1 User Time Location John 5pm (x1, y1) John 5:05pm (x2, y2) User view of the data - Smoothed locations - Inferred variables e.g. select count(*) group by mode sliding window 5 minutes Application of the model/inference is pushed inside the database Opens up many optimization opportunities e.g. can do inference lazily when queried etc Original noisy GPS data Correlations User User Time Location Mode prob John 5pm (x’1, y’1) Walking 0.9 John 5pm (x’1, y’1) Car 0.1 John 5:05pm (x’2, y’2) Walking 0 John 5:05pm (x’2, y’2) Car 1 Strong and complex correlations across tuples - Mutual exclusivity - Temporal correlations MauveDB: Status  Written in the Apache Derby Java open source database system  Support for Regression- and Interpolation-based views   Neither produce probabilistic data  SIGMOD 2006 (w/ Sam Madden) Currently building support for views based on Dynamic Bayesian networks [Bhargav]  Kalman Filters, HMMs etc  Initial focus on the user interfaces and efficient inference  Will generate probabilistic data; may not be able to do anything too sophisticated with it Research Challenges/Future Work   Generalizing to arbitrary models ?  Develop APIs for adding arbitrary models  Try to minimize the work of the model developer Probabilistic databases  Uncertain data with complex correlation patterns  Query processing, query optimization  View maintenance in presence of high-rate measurement streams Thanks !! Mauve == Model-based User Views

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Models and Sensor Networks