Download Coupling databases and advanced analytical tools

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia, lookup

Big data wikipedia, lookup

Concurrency control wikipedia, lookup

Microsoft SQL Server wikipedia, lookup

Database wikipedia, lookup

SQL wikipedia, lookup

Relational model wikipedia, lookup

Oracle Database wikipedia, lookup

PL/SQL wikipedia, lookup

Clusterpoint wikipedia, lookup

Open Database Connectivity wikipedia, lookup

Database model wikipedia, lookup

Transcript
Master Thesis Presentation [IT4BI]
Coupling Databases
and Advanced
Analytical Tools (R)
Supervisor:
Prof Alberto Abelló (PhD)
FIB-UPC (BarcelonaTech)
[email protected]
Student:
Sedem Seakomo
FIB-UPC (BarcelonaTech)
saviour.sedem.kofi.seakomo
@est.fib.upc.edu
1
Outline
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Conclusions
2
Introduction
Introduction
Background & Motivation
Research Questions
Scope (Delimitations & Limitations)
Importance & Contribution
Related Work
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Conclusions
3
Introduction
SQL/relational DBMS are powerful systems!
Managing, querying, and aggregating data
But what about complex analytics? Not really!
Inferences, predictions, subtle relationships in data
In spite of this, organizations still house large
amount of data in various SQL/RDBMS
So what do we do for complex analytics?
4
Introduction
Objective:
Examine level of development of integration of
R+DBMS
Assess performance, scalability and completeness of
R+DBMS integration
Motivation:
New Industry (Analytics Industry)
Development in Analytics front: Gleaning information and
insights from data, now an industry in itself
Data Mining (Complex Analytics)
Increasing relevance of data mining (to be driven by complex
analytics) for revealing valuable insights from data
5
Research Questions
What is the current level of development (completeness) of
integration of R with DMBS?
How is the performance of coupling databases with advanced
analytical tool (R) compared to stand-alone analytical tool (R)?
How is the scalability of coupling databases with advanced
analytical tool (R) compared to stand-alone analytical tool (R)?
What are the inherent implications of architectures of R integration
that impact performance?
Are there any lessons to be learnt on the way forward?
6
Scope (delimitations & limitations)
Focused on benchmarking the performance, scalability and
completeness of selected DBMS+R
Benchmarks covers mainly matrix operations employed (forms
the core of) in advanced analytics
Benchmarking of intra-command parallelism was not covered
Focused on coupling of R and RDBMS (Oracle, Postgres, DB2
and SQL Server); non-RDBMS or NoSQL databases not covered
Focused on directly coupling R at the data layer (not at the
analytic layer and/or presentation layer)
7
Introduction
Contributions:
Better performance is achievable by coupling
databases with advanced analytical tools (R)
Such approach is recommended for complex
analytics involving significant amount of data
Architectures where more analytic functions have
equivalent native SQL counterparts executable indatabase produces best performance results
Caveat: data used in analytic process must be
efficiently retrieved and passed to the analytic
functions, lest there will be worsen performance
8
Introduction
Related work:
Analytics and databases:
Database Analytics Acceleration using FPGAs [10]
For evaluating expensive analytics queries while saving CPU resources
The MADlib Analytics Library or MAD skills, the SQL [11]
Introduces open-source library of in-database analytic methods of SQL-based
algorithms for machine learning, data mining and statistics inside database engine
Towards a Unified Architecture for in-RDBMS Analytics [12]
Presents unified architecture for in-RDBMS analytics with emphasis on faster
implementation of new statistical techniques in RDBMS
Performance benchmark studies w.r.t R:
By Philippe Grosjean[3], Stefan Steinhaus[13], Donald Knuth[14]
Centered on comparing performance of versions of R implementations, R
implementation with and without some packages and R as analytical tool
compared with other analytical tools
But, our work:
The performance study of R+DBMS vs. stand-alone R
9
The Problem
Introduction
The Problem
Advanced analytical tools
Database Management Systems
Bringing the “two worlds” together
Thesis statement (Hypothesis) declaration
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Conclusions
10
Advanced Analytical Tools
Inclined towards linear algebra
Up-side:
Statistical software provide rich and very advanced
analytical functionality for data analysis and modelling
Down-side:
Can handle only limited amounts of data.
Example: Some packages (base R and IBM SPSS) operate
entirely in main memory
11
Database Management Systems
Founded on relational algebra (RDBMS)
Up-side:
DBMS can store and process large amount of data
Down-side:
But provide insufficient analytical functionality
SQL simulations of linear algebra operations
will often result in abysmal I/O and CPU performance
are knotty for linear algebra operations with iterations
are hard to fathom and makes code maintenance expensive
12
Bringing the “two worlds” together
We have a case at hand!
So, how do we bridge the gap?
Database Management
Systems (relational algebra)
Advanced Statistical
Packages (linear algebra)
13
Bringing the “two worlds” together
Solution: synergy!
Employ extended RDBMS features to power the
embedded/integrated/coupled execution of R.
14
Bringing the “two worlds” together
Has the following advantages:
Avert performance problems associated with the
abusive use of SQL (relational algebra ops) for
advanced analytics (linear algebra ops)
Synergize robust data management capabilities of
DBMS and rich statistical functionalities of analytical
tools
Benefits (Performance+Security) of taking algorithms
(Processing Logic) to data rather than data to algorithm
15
Thesis Statement (Hypothesis)
Coupling databases and advanced analytical tools (R)
leads to better and enhanced analytic performance
than stand-alone advanced analytical tools (R)
16
State of the art
Introduction
The Problem
State of the art (Existing systems review)
Advanced analytical tool R
Different DBMS architecture of R implementation
Choice of DBMS for empirical study
Methodology
Empirical Work
The Results
Conclusions
17
Integration with R
At three layers within the analytic stack
Data Layer
(e.g. Oracle R Enterprise, Sybase RAP, SAP HANA, IBM Netezza)
Analytics Layer
(e.g. SAS, IBM SPSS, RStudio, Matlab, Zementis)
Presentation Layer
(e.g. Tableau, Jaspersoft BI Software, TIBCO Spotfire's BI Dashboard)
18
Integration with R at Data Layer
Alternative ways of integrating R with db
Outside-in:
R connect with DB using JDBC/ODBC and R retrieves (pulls)
the data to be analyzed from the db
Inside-out:
Data is transferred (pushed) to R from within the database and
the aggregated and/or analyzed results sets are sent back
from R to the database
Embedded:
R environment (components) and/or execution is made an
integral part of the core DBMS
19
Diff DBMS Architecture w.r.t R
Integrations/Architectural Arrangements
DBMS
Embedded
Outside-in/Inside-out
Oracle
YES: ORE Server
YES: ROracle, JDBC
PostgreSQL
YES: PLR
YES: RPostgres, RODBC
Sybase RAP
YES: RAP Store- UDF(C, C++)
YES: RJDBC
SQL Server
NO: But CLR, Ext Proc
YES: RODBC
NO: But CLR, Ext Proc UDF(C,
C++, Java, COBOL)
YES: RJDBC, RODBC
DB2
Cloudera
Impala
SAP HANA
NO
YES: ODBC, JDBC
NO
YES: RODBC, RJDBC, RHANA
20
Methodology
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Research Approach
Research Design
Data Used
Empirical Work
The Results
Conclusions
21
Methodology
Research Approach
Quantitative research (experimental) approach
Need to collect numeric performance data
Carry out various kinds of numeric-based analyses
Research Design
Adopted and adapted R Benchmark 2.5 [3] and
Revolution RevoR Enterprise Benchmark [2]
Tests designed for stand-alone R and R+ Oracle,
PostgreSQL, DB2, SQLServer
Tests: 3 categories of performance tests;
Matrix Calculation, Matrix Functions and Program Control
22
Data Used
Input data generated in R, data set consists
two dimensional array of floating-point numbers
1,000 observations (cols) by 16,000 variables (rows)
Used a stochastic process, Brownian Motion
i
i
Xi = ∑ (Yn ) • (
)
k
n =1
MatrixX
obs1
obs2
...
obs1000
var1
var2
var3
...
Where:
var16000
Xi (the series) then stand-in for the Brownian Motion
Yn is a sequence of k variables normally distributed elements
23
Empirical Work
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
Benchmark tests
Experimental design
Measurements & controls
The Results
Conclusions
24
Experimental Setup:R+SQLServer
Traditional (RODBC) Integration
Installed SQL Server 2012 (64-bit)
Installed Open Source R 2.13.2 (64-bit client)
Installed of RODBC package from RGUI
Common Language Runtime (CLR) Integration
CLR Stored Procedures are .NET objects which run in db memory
Created the usual R script files
Developed C# CLR with embedded R; compiled to get DLL
Enabled CLR integration feature of the SQL Server
Created assembly from the DLL; SPs which ref the assembly
Ran the stored procedures with the R script files as input
25
The Results
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
MC, MF, PC, Overall benchmarks
Implications of the results and findings
Which integration architecture works well?
Conclusions
26
Empirical Results
Average overall benchmark results:
OVERALL Performance
SQL Server
System/+R
DB2
Oracle
OVERALL
PostgreSQL
Stand-Alone R
-
20.00
40.00
60.00
80.00
100.00
120.00
140.00
160.00
180.00
Run-time (normalised)
27
Empirical Results
Matrix Calculation (MC) benchmark results
Matrix Calculation Performance (MC)
SQL Server
System/+R
DB2
Oracle
MC
PostgreSQL
Stand-Alone R
-
10.00
20.00
30.00
40.00
50.00
60.00
Run-time (normalised)
28
Empirical Results
Matrix Function (MF) benchmark results
Matrix Function Performance (MF)
SQL Server
System/+R
DB2
Oracle
MF
PostgreSQL
Stand-Alone R
-
10.00
20.00
30.00
40.00
50.00
60.00
70.00
80.00
90.00
Run-time (normalised)
29
Empirical Results
Program Control (PC) benchmark results
Benchmark Stand-Alone R PostgreSQL
Oracle
DB2
SQL Server
PC01
2.71
2.78
2.77
2.70
2.77
PC02
0.30
0.40
0.31
0.28
0.29
PC03
0.63
0.60
0.43
0.65
0.66
PC04
0.51
0.50
0.51
0.53
0.51
PC05
0.38
0.38
0.36
0.38
0.38
KEY:
PC01: Fibonacci numbers; ctrl flow
PC03: gcd2; recursive
PC05: Escoufier’s method on matrix
PC02: Hilbert matrix; ; ctrl flow
PC04: Toeplitz matrix; looping
30
Empirical Results
Paired t-test on PC results (PosgreSQL, Oracle)
Mean
Variance
Observations
Pearson Correlation
Hypothesized Mean
Difference
df
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
Variable 1
0.932
1.07492
5
0.997786692
Variable 2
0.876
1.12668
5
0
4
1.691541861
0.082996265
2.131846782
0.16599253
2.776445105
The mean performance difference (M=0.06, SD =0.074, N= 5) was not
significantly greater than zero, t(4)=1.69, two-tail p = 0.166, providing evidence
that there is no considerable difference in the performances of the two DBMSs.
31
Empirical Results
Scalability benchmark results
30.00
Oracle shows slightly better
scalability edge over standalone R for small datasets
25.00
20.00
15.00
dTimes1-4m-r
10.00
dTimes1-4mc-ore
5.00
200.00
180.00
160.00
140.00
120.00
100.00
80.00
60.00
40.00
20.00
-
dTimes4-16mc-r
MF8
MF7
MF6
MF5
MF4
MF3
MF2
MF1
MC5
MC4
MC3
MC2
dTimes4-16mc-ore
MC1
Stand-alone R is overwhelmed
by large datasets; that is when
R+DBMS’s edge is manifested
MF8
MF7
MF6
MF5
MF4
MF3
MF2
MF1
MC5
MC4
MC3
MC2
MC1
-
32
Empirical Results Reliability
Average vs. Minimum Results: about same
Avg OVERALL Performance
Performance patterns
observed remain exactly same
System/+R
SQL Server
DB2
Oracle
PostgreSQL
OVERALL
Stand-Alone R
-
50.00
100.00
150.00
200.00
Run-time (normalised)
Min OVERALL Performance
No significant variations in
actual recorded values
System/+R
SQL Server
DB2
Oracle
PostgreSQL
OVERALL
Stand-Alone R
-
50.00
100.00
150.00
200.00
Run-time (normalised)
33
Why R+PostgreSQL works bad?
Postgres performed well in tests with less data
Timing retrieval of database resident data as matrix
Retrieving
DB Data
Run 1
Run 2
Run 3
Run 4
Run 5
Run 6
Total
Run7
Oracle
0.15
0.14
0.14
0.14
0.14
0.15
0.14
1.00
PostgreSQL
20.12
19.05
19.12
19.12
19.03
19.11
19.12
134.67
Average
0.14
19.24
Direct rows fetch (SELECT * FROM stockHist)
Oracle (4.51 sec) is 2.66 times faster than PostgreSQL (12.02 sec)
Implication:
Poor performance of PostgreSQL-coupled-R is not
exclusively the consequence of the implementation but
also the database itself (data retrieval /fetching)
34
Why R+Oracle works well?
Architecture of Oracle R Enterprise. Adapted from [9]
In-db statistic engine
Storing of R Script in-db
Capability of spawning multiple R engine instances
Efficient data retrieval and passing (rqTableEval, rqRowEval)
35
Conclusions
Introduction
The Problem
State of the art (Existing systems review)
Methodology
Empirical Work
The Results
Conclusions
Lessons learnt
Future Studies
Final Words
36
Implications to Research Questions
Growing level of development of R+DBMS
Most capabilities of stand-alone R obtainable in R+DBMS
Better performance with R+DBMS
Provided data is efficiently retrieved and passed
R still competitive in less data-intensive analytics
Architectural Implications
Positioning of analytic engine w.r.t database
Existence of native SQL equivalent of analytic functions
Extent of exploitation of db parallelism and db scalability
Lessons, going forward
Technique for retrieving and passing data makes huge
impact, regardless of how fast substantive analytics runs
37
Future Studies
Benchmark in-memory, col-oriented, doc-oriented dbs
Study effective & efficient data access by analytic functions
Compare performance gains: R+RDBMS vs. R+NoSQL dbs
Max gain from parallelism and scalability of R+DBMSs
Benchmark on different OS and varying data amounts
Benchmark on datasets with varied attribute properties
38
Conclusions (final words)
In recommending R+DBMS, architectures must
Facilitate efficient retrieval and passing of data from
the database objects (tables, views, procedures, etc)
to the analytic functions;
Lessen or eliminate data movement and reduce other
run-time overheads;
Maintain data security (the C.I.A of data);
C.I.A = Confidentiality, Integrity and Availability
Reduce development overheads for introducing new &
maintaining existing analytics techniques in the DBMSs
39
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
Robert Kabacoff. R in Action: Data analysis and graphics with R. Manning Publications Co., 2011
Revolution Analytics. Revolution RevoR Enterprise Benchmark Details: Benchmark Scripts.URL: http://www.revolutionanalytics.com/revolution-revorenterprise-benchmark-details.
Philippe Grosjean. R Benchmark 2.5. 2008. URL: http://r.research.att.com/benchmarks/R-benchmark-25.R.
Edwin Grappin. Generate stock option prices - How to simulate a Brownian motion. http://probaperception.blogspot.com.es/2012/10/generatestock-option-prices-how-to.html. Blog. 2012.
Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed 12-March-2014]. ORACLE, 2014. URL:
http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
David Smith. R integrated throughout the enterprise analytics stack. http://blog.revolutionanalytics.com/2012/02/r-in-the-enterprise.html/. Blog.
2012.
Elaine Chen. Using R and Tableau. Tech. rep. Also available as http://www.tableausoftware.com/sites/default/files/media/using-r-and-tableausoftware_0.pdf. 2013.
CodePlex (Microsoft). R.NET. URL: http://rdotnet.codeplex.com/.
Mark Hornick and Tim Vlamis. Oracle R Enterprise Hands-on Lab. [Online; accessed12-March-2014]. ORACLE, 2014. URL:
http://www.vlamis.com/storage/papers/Hornick%20-%20ORE%20Hands-On%20Lab.pdf.
Bharat Sukhwani, Hong Min, Mathew Thoennes, Parijat Dube, Balakrishna Iyer, Bernard Brezzo, Donna Dillenberger, and Sameh Asaad. “Database
analytics acceleration using FPGAs”. In: Proceedings of the 21st international conference on Parallel architectures and compilation techniques.
ACM. 2012, pp. 411–420.
Joseph M Hellerstein, Christoper R´e, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton,
Xixuan Feng, Kun Li, and Arun Kumar. “The MADlib analytics library: or MAD skills, the SQL”. In: Proceedings of the VLDB Endowment 5.12 (2012),
pp. 1700–1711.
Xixuan Feng, Arun Kumar, Benjamin Recht, and Christopher R´e. “Towards a unified architecture for in-RDBMS analytics”. In: Proceedings of the
2012 ACM SIGMOD InternationalConference on Management of Data. ACM. 2012, pp. 325–336.
13.
Stefan Steinhaus. Comparison of mathematical programs for data analysis. 1999.
14.
Donald Knuth. Comparison of mathematical programs for data analysis. 2008. URL: http://www.scientificweb.com/ncrunch/ncrunch5.pdf.
40
Thank you
41