Download The Case for Predictive Database Systems: Opportunities

The Case for Predictive Database Systems: Opportunities and Challenges Mert Akdere, Ugur Cetintemel, Matteo Riondato, Eli Upfal, Stan Zdonik Brown University .:CIDR 2011:. Data-driven Predictive Analytics  Essentials: Data-driven Model-based Prediction for Regression, Classification, Clustering, and Association Historical & Current data Statistical Models (Bayesian Nets, Decision Trees) Future, missing data values, events, trends  Widespread apps: monitoring, planning and decision support  User-facing apps: financial analysis, customer decision support, inventory management, …  System-facing apps: optimization, tuning, self management… 1 Support for Predictive Applications: Off-Database Design Iterative Learning Process Database System Read Data Train Sampling Test  Database used as backend storage  Off-database prediction with Matlab, R, other machine learning libraries or custom code 2 Select Feature(s) Build Model Type1 Build Model Type2 Apply Model Build Model Type3 Accuracy feedback Output Model Off-Database Design: Problems Iterative Learning Process Database System Read Data Train Sampling  Too much data movement!  Duplication of database functionality required for efficiency  Usability issues: • Custom, hard-to-manage prediction logic 3 Select Feature(s) Test Build Model Type1 Build Model Type2 Apply Model Build Model Type3 Accuracy feedback Output Model Predictive Database Systems  In-Database Prediction Systemfacing apps Prediction Unit Declarative Prediction Queries Predictive Database System  Declarative Prediction Queries  Models as First-class Citizens  Model Management 4 User-facing Applications Predictive Database Architecture Data Management Database Catalog Statistics Data Indices / Views  Database Components (Prediction) Query Optimizer Executor Model Management Model Catalog Model Instances  Prediction Components  Data and Model management in similar ways  Data independence and model independence 5 Prediction Statistics Relational DBMSs and Predictive DBMSs 6 RDBMS PDBMS Physical design and specialized data structures Facilitate model building and predictive query execution Pre-computation and materialization Pre-building models for adhoc and interactive queries Profiling and modeling Cost/accuracy characteristics of models for optimization Query optimization Alternate ways of model building and prediction Longview  Predictive database project @Brown  Prototype implementation on PostgreSQL  Built on database extensibility mechanisms (e.g., views, triggers) for portability  Offers standard and well-tested predictive functionality  Longview does not compete with custom/optimized prediction methods  Prediction as a “commodity” service, e.g., Google prediction API 7 Use-Case: Customer Analysis  Given status for a subset of the customers, build a model for predicting the status of all customers.  customer status = ‘preferred’ or ‘regular’ Database: Customer(cid, age, city) Orders(oid, cid, total) Training Data: CustomerStatus(cid, status) Build Prediction Model(s): Features age, city, total 8 Model Target SVM, status Linear Regression,… Customer Analysis w/ Declarative Prediction Queries Predictor schema (p_schema) age, city, total  status Status predictor definition and training data specification Prediction with status predictor for new data 9 Create P_schema StatusSchema ( age int, city text, total int, TARGET status text) Create Predictor StatusPredictor On StatusSchema With Data select age, city, sum(total), status from customer c, orders o, customerstatus cs where c.cid = o.cid and c.cid = cs.cid group by c.cid, name, city, status Select c.cid, StatusPredictor(age, city, total) from customer c, (select cid, sum(total) as total from Orders group by cid) as o where c.cid = o.cid Customer(cid, age, city), Orders(oid, cid, total), CustomerStatus(cid, status) Model Building and Maintenance  Prediction models integrated as black-box functions with Prediction Model API  Automatically build, materialize and maintain models for predictors  Model management modes:  Accuracy or time thresholds  Continuous in the background 10 Model Space  The space of models searched by Longview in model building  Model Space Dimensions:  Model selection  Feature selection  Data selection  Sampling  Partitioning 11 Model Search 12 Task Description Opportunities Model Selection Choosing the model type: SVMs, Bayesian Networks, … • Profiling and Modeling for computation/memory and I/O requirements of prediction models • Parallel model building & testing Feature Selection Choosing the model • Specialized data structures, access attributes methods and data organizations for efficient access to feature subsets, e.g., columnar storage, multidimensional indices Data Selection Choosing the model building data w/ Sampling, Partitioning • Efficient disk organizations for (sequential) sampling, and efficient access and storage methods for data partitions Model Composition  Model Ensembles  Combine/Select results of many models based on decision rules weighted vote model1 For Computational Efficiency, Improved Accuracy and Extensibility  Model Sequencing  Two or models are combined where the output of one model is used as input in another model. For Extensibility 13 model1 model2 model2 model3 model3 Scaling Up w/ Model Composition  Brute-force approach:  Exact (or approximate) distributed implementations of prediction methods, e.g., Mahout  Using ensembles:  Sampling  Ensembles of models on memory-size data samples  Partitioning  Separate models for each data partition 14 Case Study: Query Performance Prediction (QPP)  Predict query execution times with pre-runtime features  Useful for workload scheduling, resource allocation  Database systems are complex with many interacting components  Hard to achieve an accurate and portable analytical model 15 QPP w/ Model Composition  Prediction on TPC-H queries  10GB DB, workload of 500 queries  Plan-level prediction  10% avg. error on known templates  232% avg. error on an unforeseen query template  Operator-level prediction  56% avg. error on unforeseen templates  Hybrid prediction combines operator-level and plan-level  38% avg. error on templates 16 Related Work  Declarative Forecasting Queries. Duan and Babu. Processing forecasting queries. VLDB 2007  MauveDB. Querying model-based views. Deshpande and Madden. MauveDB: supporting model-based user views in database systems. SIGMOD 2006  Data Mining Services in DBMSs. Data mining with stored procedures. Microsoft SQL Server Data Mining IBM InfoShpere Oracle Data Mining 17 Conclusion  Predictive DBMSs to enable efficient declarative predictive analytics  Both user and system facing apps benefit  Model management as the key technology  Similar to data management  Predictive models as first class citizens  Automated model building and use in predictive queries  Cost awareness  Longview aims to build a full-fledged PDBMS 18 Thanks!!! Questions ? 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download The Case for Predictive Database Systems: Opportunities