Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Master Thesis Presentation [IT4BI] Coupling Databases and Advanced Analytical Tools (R) Supervisor: Prof Alberto Abelló (PhD) FIB-UPC (BarcelonaTech) [email protected] Student: Sedem Seakomo FIB-UPC (BarcelonaTech) saviour.sedem.kofi.seakomo @est.fib.upc.edu 1 Outline Introduction The Problem State of the art (Existing systems review) Methodology Empirical Work The Results Conclusions 2 Introduction Introduction Background & Motivation Research Questions Scope (Delimitations & Limitations) Importance & Contribution Related Work The Problem State of the art (Existing systems review) Methodology Empirical Work The Results Conclusions 3 Introduction SQL/relational DBMS are powerful systems! Managing, querying, and aggregating data But what about complex analytics? Not really! Inferences, predictions, subtle relationships in data In spite of this, organizations still house large amount of data in various SQL/RDBMS So what do we do for complex analytics? 4 Introduction Objective: Examine level of development of integration of R+DBMS Assess performance, scalability and completeness of R+DBMS integration Motivation: New Industry (Analytics Industry) Development in Analytics front: Gleaning information and insights from data, now an industry in itself Data Mining (Complex Analytics) Increasing relevance of data mining (to be driven by complex analytics) for revealing valuable insights from data 5 Research Questions What is the current level of development (completeness) of integration of R with DMBS? How is the performance of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? How is the scalability of coupling databases with advanced analytical tool (R) compared to stand-alone analytical tool (R)? What are the inherent implications of architectures of R integration that impact performance? Are there any lessons to be learnt on the way forward? 6 Scope (delimitations & limitations) Focused on benchmarking the performance, scalability and completeness of selected DBMS+R Benchmarks covers mainly matrix operations employed (forms the core of) in advanced analytics Benchmarking of intra-command parallelism was not covered Focused on coupling of R and RDBMS (Oracle, Postgres, DB2 and SQL Server); non-RDBMS or NoSQL databases not covered Focused on directly coupling R at the data layer (not at the analytic layer and/or presentation layer) 7 Introduction Contributions: Better performance is achievable by coupling databases with advanced analytical tools (R) Such approach is recommended for complex analytics involving significant amount of data Architectures where more analytic functions have equivalent native SQL counterparts executable indatabase produces best performance results Caveat: data used in analytic process must be efficiently retrieved and passed to the analytic functions, lest there will be worsen performance 8 Introduction Related work: Analytics and databases: Database Analytics Acceleration using FPGAs [10] For evaluating expensive analytics queries while saving CPU resources The MADlib Analytics Library or MAD skills, the SQL [11] Introduces open-source library of in-database analytic methods of SQL-based algorithms for machine learning, data mining and statistics inside database engine Towards a Unified Architecture for in-RDBMS Analytics [12] Presents unified architecture for in-RDBMS analytics with emphasis on faster implementation of new statistical techniques in RDBMS Performance benchmark studies w.r.t R: By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14] Centered on comparing performance of versions of R implementations, R implementation with and without some packages and R as analytical tool compared with other analytical tools But, our work: The performance study of R+DBMS vs. stand-alone R 9 The Problem Introduction The Problem Advanced analytical tools Database Management Systems Bringing the “two worlds” together Thesis statement (Hypothesis) declaration State of the art (Existing systems review) Methodology Empirical Work The Results Conclusions 10 Advanced Analytical Tools Inclined towards linear algebra Up-side: Statistical software provide rich and very advanced analytical functionality for data analysis and modelling Down-side: Can handle only limited amounts of data. Example: Some packages (base R and IBM SPSS) operate entirely in main memory 11 Database Management Systems Founded on relational algebra (RDBMS) Up-side: DBMS can store and process large amount of data Down-side: But provide insufficient analytical functionality SQL simulations of linear algebra operations will often result in abysmal I/O and CPU performance are knotty for linear algebra operations with iterations are hard to fathom and makes code maintenance expensive 12 Bringing the “two worlds” together We have a case at hand! So, how do we bridge the gap? Database Management Systems (relational algebra) Advanced Statistical Packages (linear algebra) 13 Bringing the “two worlds” together Solution: synergy! Employ extended RDBMS features to power the embedded/integrated/coupled execution of R. 14 Bringing the “two worlds” together Has the following advantages: Avert performance problems associated with the abusive use of SQL (relational algebra ops) for advanced analytics (linear algebra ops) Synergize robust data management capabilities of DBMS and rich statistical functionalities of analytical tools Benefits (Performance+Security) of taking algorithms (Processing Logic) to data rather than data to algorithm 15 Thesis Statement (Hypothesis) Coupling databases and advanced analytical tools (R) leads to better and enhanced analytic performance than stand-alone advanced analytical tools (R) 16 State of the art Introduction The Problem State of the art (Existing systems review) Advanced analytical tool R Different DBMS architecture of R implementation Choice of DBMS for empirical study Methodology Empirical Work The Results Conclusions 17 Integration with R At three layers within the analytic stack Data Layer (e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza) Analytics Layer (e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis) Presentation Layer (e.g. Tableau, Jaspersoft BI Software, TIBCO Spotfire's BI Dashboard) 18 Integration with R at Data Layer Alternative ways of integrating R with db Outside-in: R connect with DB using JDBC/ODBC and R retrieves (pulls) the data to be analyzed from the db Inside-out: Data is transferred (pushed) to R from within the database and the aggregated and/or analyzed results sets are sent back from R to the database Embedded: R environment (components) and/or execution is made an integral part of the core DBMS 19 Diff DBMS Architecture w.r.t R Integrations/Architectural Arrangements DBMS Embedded Outside-in/Inside-out Oracle YES: ORE Server YES: ROracle, JDBC PostgreSQL YES: PLR YES: RPostgres, RODBC Sybase RAP YES: RAP Store- UDF(C, C++) YES: RJDBC SQL Server NO: But CLR, Ext Proc YES: RODBC NO: But CLR, Ext Proc UDF(C, C++, Java, COBOL) YES: RJDBC, RODBC DB2 Cloudera Impala SAP HANA NO YES: ODBC, JDBC NO YES: RODBC, RJDBC, RHANA 20 Methodology Introduction The Problem State of the art (Existing systems review) Methodology Research Approach Research Design Data Used Empirical Work The Results Conclusions 21 Methodology Research Approach Quantitative research (experimental) approach Need to collect numeric performance data Carry out various kinds of numeric-based analyses Research Design Adopted and adapted R Benchmark 2.5 [3] and Revolution RevoR Enterprise Benchmark [2] Tests designed for stand-alone R and R+ Oracle, PostgreSQL, DB2, SQLServer Tests: 3 categories of performance tests; Matrix Calculation, Matrix Functions and Program Control 22 Data Used Input data generated in R, data set consists two dimensional array of floating-point numbers 1,000 observations (cols) by 16,000 variables (rows) Used a stochastic process, Brownian Motion i i Xi = ∑ (Yn ) • ( ) k n =1 MatrixX obs1 obs2 ... obs1000 var1 var2 var3 ... Where: var16000 Xi (the series) then stand-in for the Brownian Motion Yn is a sequence of k variables normally distributed elements 23 Empirical Work Introduction The Problem State of the art (Existing systems review) Methodology Empirical Work Benchmark tests Experimental design Measurements & controls The Results Conclusions 24 Experimental Setup:R+SQLServer Traditional (RODBC) Integration Installed SQL Server 2012 (64-bit) Installed Open Source R 2.13.2 (64-bit client) Installed of RODBC package from RGUI Common Language Runtime (CLR) Integration CLR Stored Procedures are .NET objects which run in db memory Created the usual R script files Developed C# CLR with embedded R; compiled to get DLL Enabled CLR integration feature of the SQL Server Created assembly from the DLL; SPs which ref the assembly Ran the stored procedures with the R script files as input 25 The Results Introduction The Problem State of the art (Existing systems review) Methodology Empirical Work The Results MC, MF, PC, Overall benchmarks Implications of the results and findings Which integration architecture works well? Conclusions 26 Empirical Results Average overall benchmark results: OVERALL Performance SQL Server System/+R DB2 Oracle OVERALL PostgreSQL Stand-Alone R - 20.00 40.00 60.00 80.00 100.00 120.00 140.00 160.00 180.00 Run-time (normalised) 27 Empirical Results Matrix Calculation (MC) benchmark results Matrix Calculation Performance (MC) SQL Server System/+R DB2 Oracle MC PostgreSQL Stand-Alone R - 10.00 20.00 30.00 40.00 50.00 60.00 Run-time (normalised) 28 Empirical Results Matrix Function (MF) benchmark results Matrix Function Performance (MF) SQL Server System/+R DB2 Oracle MF PostgreSQL Stand-Alone R - 10.00 20.00 30.00 40.00 50.00 60.00 70.00 80.00 90.00 Run-time (normalised) 29 Empirical Results Program Control (PC) benchmark results Benchmark Stand-Alone R PostgreSQL Oracle DB2 SQL Server PC01 2.71 2.78 2.77 2.70 2.77 PC02 0.30 0.40 0.31 0.28 0.29 PC03 0.63 0.60 0.43 0.65 0.66 PC04 0.51 0.50 0.51 0.53 0.51 PC05 0.38 0.38 0.36 0.38 0.38 KEY: PC01: Fibonacci numbers; ctrl flow PC03: gcd2; recursive PC05: Escoufier’s method on matrix PC02: Hilbert matrix; ; ctrl flow PC04: Toeplitz matrix; looping 30 Empirical Results Paired t-test on PC results (PosgreSQL, Oracle) Mean Variance Observations Pearson Correlation Hypothesized Mean Difference df t Stat P(T<=t) one-tail t Critical one-tail P(T<=t) two-tail t Critical two-tail Variable 1 0.932 1.07492 5 0.997786692 Variable 2 0.876 1.12668 5 0 4 1.691541861 0.082996265 2.131846782 0.16599253 2.776445105 The mean performance difference (M=0.06, SD =0.074, N= 5) was not significantly greater than zero, t(4)=1.69, two-tail p = 0.166, providing evidence that there is no considerable difference in the performances of the two DBMSs. 31 Empirical Results Scalability benchmark results 30.00 Oracle shows slightly better scalability edge over standalone R for small datasets 25.00 20.00 15.00 dTimes1-4m-r 10.00 dTimes1-4mc-ore 5.00 200.00 180.00 160.00 140.00 120.00 100.00 80.00 60.00 40.00 20.00 - dTimes4-16mc-r MF8 MF7 MF6 MF5 MF4 MF3 MF2 MF1 MC5 MC4 MC3 MC2 dTimes4-16mc-ore MC1 Stand-alone R is overwhelmed by large datasets; that is when R+DBMS’s edge is manifested MF8 MF7 MF6 MF5 MF4 MF3 MF2 MF1 MC5 MC4 MC3 MC2 MC1 - 32 Empirical Results Reliability Average vs. Minimum Results: about same Avg OVERALL Performance Performance patterns observed remain exactly same System/+R SQL Server DB2 Oracle PostgreSQL OVERALL Stand-Alone R - 50.00 100.00 150.00 200.00 Run-time (normalised) Min OVERALL Performance No significant variations in actual recorded values System/+R SQL Server DB2 Oracle PostgreSQL OVERALL Stand-Alone R - 50.00 100.00 150.00 200.00 Run-time (normalised) 33 Why R+PostgreSQL works bad? Postgres performed well in tests with less data Timing retrieval of database resident data as matrix Retrieving DB Data Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Total Run7 Oracle 0.15 0.14 0.14 0.14 0.14 0.15 0.14 1.00 PostgreSQL 20.12 19.05 19.12 19.12 19.03 19.11 19.12 134.67 Average 0.14 19.24 Direct rows fetch (SELECT * FROM stockHist) Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec) Implication: Poor performance of PostgreSQL-coupled-R is not exclusively the consequence of the implementation but also the database itself (data retrieval /fetching) 34 Why R+Oracle works well? Architecture of Oracle R Enterprise. Adapted from [9] In-db statistic engine Storing of R Script in-db Capability of spawning multiple R engine instances Efficient data retrieval and passing (rqTableEval, rqRowEval) 35 Conclusions Introduction The Problem State of the art (Existing systems review) Methodology Empirical Work The Results Conclusions Lessons learnt Future Studies Final Words 36 Implications to Research Questions Growing level of development of R+DBMS Most capabilities of stand-alone R obtainable in R+DBMS Better performance with R+DBMS Provided data is efficiently retrieved and passed R still competitive in less data-intensive analytics Architectural Implications Positioning of analytic engine w.r.t database Existence of native SQL equivalent of analytic functions Extent of exploitation of db parallelism and db scalability Lessons, going forward Technique for retrieving and passing data makes huge impact, regardless of how fast substantive analytics runs 37 Future Studies Benchmark in-memory, col-oriented, doc-oriented dbs Study effective & efficient data access by analytic functions Compare performance gains: R+RDBMS vs. R+NoSQL dbs Max gain from parallelism and scalability of R+DBMSs Benchmark on different OS and varying data amounts Benchmark on datasets with varied attribute properties 38 Conclusions (final words) In recommending R+DBMS, architectures must Facilitate efficient retrieval and passing of data from the database objects (tables, views, procedures, etc) to the analytic functions; Lessen or eliminate data movement and reduce other run-time overheads; Maintain data security (the C.I.A of data); C.I.A = Confidentiality, Integrity and Availability Reduce development overheads for introducing new & maintaining existing analytics techniques in the DBMSs 39 References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011 Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL: http://www.revolutionanalytics.com/revolution-revorenterprise-benchmark-details. Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R. Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion. http://probaperception.blogspot.com.es/2012/10/generatestock-option-prices-how-to.html. Blog. 2012. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf. David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog. 2012. Elaine Chen. Using R and Tableau. Tech. rep. Also available as http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableausoftware_0.pdf. 2013. CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/. Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL: http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf. Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques. ACM. 2012, pp. 411–420. Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012), pp. 1700–1711. Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the 2012 ACM SIGMOD InternationalConference on Management of Data. ACM. 2012, pp. 325–336. 13. Stefan Steinhaus. Comparison of mathematical programs for data analysis. 1999. 14. Donald Knuth. Comparison of mathematical programs for data analysis. 2008. URL: http://www.scientificweb.com/ncrunch/ncrunch5.pdf. 40 Thank you 41