Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Case for Predictive Database Systems: Opportunities and Challenges Mert Akdere, Ugur Cetintemel, Matteo Riondato, Eli Upfal, Stan Zdonik Brown University .:CIDR 2011:. Data-driven Predictive Analytics Essentials: Data-driven Model-based Prediction for Regression, Classification, Clustering, and Association Historical & Current data Statistical Models (Bayesian Nets, Decision Trees) Future, missing data values, events, trends Widespread apps: monitoring, planning and decision support User-facing apps: financial analysis, customer decision support, inventory management, … System-facing apps: optimization, tuning, self management… 1 Support for Predictive Applications: Off-Database Design Iterative Learning Process Database System Read Data Train Sampling Test Database used as backend storage Off-database prediction with Matlab, R, other machine learning libraries or custom code 2 Select Feature(s) Build Model Type1 Build Model Type2 Apply Model Build Model Type3 Accuracy feedback Output Model Off-Database Design: Problems Iterative Learning Process Database System Read Data Train Sampling Too much data movement! Duplication of database functionality required for efficiency Usability issues: • Custom, hard-to-manage prediction logic 3 Select Feature(s) Test Build Model Type1 Build Model Type2 Apply Model Build Model Type3 Accuracy feedback Output Model Predictive Database Systems In-Database Prediction Systemfacing apps Prediction Unit Declarative Prediction Queries Predictive Database System Declarative Prediction Queries Models as First-class Citizens Model Management 4 User-facing Applications Predictive Database Architecture Data Management Database Catalog Statistics Data Indices / Views Database Components (Prediction) Query Optimizer Executor Model Management Model Catalog Model Instances Prediction Components Data and Model management in similar ways Data independence and model independence 5 Prediction Statistics Relational DBMSs and Predictive DBMSs 6 RDBMS PDBMS Physical design and specialized data structures Facilitate model building and predictive query execution Pre-computation and materialization Pre-building models for adhoc and interactive queries Profiling and modeling Cost/accuracy characteristics of models for optimization Query optimization Alternate ways of model building and prediction Longview Predictive database project @Brown Prototype implementation on PostgreSQL Built on database extensibility mechanisms (e.g., views, triggers) for portability Offers standard and well-tested predictive functionality Longview does not compete with custom/optimized prediction methods Prediction as a “commodity” service, e.g., Google prediction API 7 Use-Case: Customer Analysis Given status for a subset of the customers, build a model for predicting the status of all customers. customer status = ‘preferred’ or ‘regular’ Database: Customer(cid, age, city) Orders(oid, cid, total) Training Data: CustomerStatus(cid, status) Build Prediction Model(s): Features age, city, total 8 Model Target SVM, status Linear Regression,… Customer Analysis w/ Declarative Prediction Queries Predictor schema (p_schema) age, city, total status Status predictor definition and training data specification Prediction with status predictor for new data 9 Create P_schema StatusSchema ( age int, city text, total int, TARGET status text) Create Predictor StatusPredictor On StatusSchema With Data select age, city, sum(total), status from customer c, orders o, customerstatus cs where c.cid = o.cid and c.cid = cs.cid group by c.cid, name, city, status Select c.cid, StatusPredictor(age, city, total) from customer c, (select cid, sum(total) as total from Orders group by cid) as o where c.cid = o.cid Customer(cid, age, city), Orders(oid, cid, total), CustomerStatus(cid, status) Model Building and Maintenance Prediction models integrated as black-box functions with Prediction Model API Automatically build, materialize and maintain models for predictors Model management modes: Accuracy or time thresholds Continuous in the background 10 Model Space The space of models searched by Longview in model building Model Space Dimensions: Model selection Feature selection Data selection Sampling Partitioning 11 Model Search 12 Task Description Opportunities Model Selection Choosing the model type: SVMs, Bayesian Networks, … • Profiling and Modeling for computation/memory and I/O requirements of prediction models • Parallel model building & testing Feature Selection Choosing the model • Specialized data structures, access attributes methods and data organizations for efficient access to feature subsets, e.g., columnar storage, multidimensional indices Data Selection Choosing the model building data w/ Sampling, Partitioning • Efficient disk organizations for (sequential) sampling, and efficient access and storage methods for data partitions Model Composition Model Ensembles Combine/Select results of many models based on decision rules weighted vote model1 For Computational Efficiency, Improved Accuracy and Extensibility Model Sequencing Two or models are combined where the output of one model is used as input in another model. For Extensibility 13 model1 model2 model2 model3 model3 Scaling Up w/ Model Composition Brute-force approach: Exact (or approximate) distributed implementations of prediction methods, e.g., Mahout Using ensembles: Sampling Ensembles of models on memory-size data samples Partitioning Separate models for each data partition 14 Case Study: Query Performance Prediction (QPP) Predict query execution times with pre-runtime features Useful for workload scheduling, resource allocation Database systems are complex with many interacting components Hard to achieve an accurate and portable analytical model 15 QPP w/ Model Composition Prediction on TPC-H queries 10GB DB, workload of 500 queries Plan-level prediction 10% avg. error on known templates 232% avg. error on an unforeseen query template Operator-level prediction 56% avg. error on unforeseen templates Hybrid prediction combines operator-level and plan-level 38% avg. error on templates 16 Related Work Declarative Forecasting Queries. Duan and Babu. Processing forecasting queries. VLDB 2007 MauveDB. Querying model-based views. Deshpande and Madden. MauveDB: supporting model-based user views in database systems. SIGMOD 2006 Data Mining Services in DBMSs. Data mining with stored procedures. Microsoft SQL Server Data Mining IBM InfoShpere Oracle Data Mining 17 Conclusion Predictive DBMSs to enable efficient declarative predictive analytics Both user and system facing apps benefit Model management as the key technology Similar to data management Predictive models as first class citizens Automated model building and use in predictive queries Cost awareness Longview aims to build a full-fledged PDBMS 18 Thanks!!! Questions ? 19