Download SAS In-Database Procedures on eBay’s Teradata System Reduces Processing Time by a Factor of 4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsoft SQL Server wikipedia , lookup

Database wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Relational model wikipedia , lookup

Clusterpoint wikipedia , lookup

Functional Database Model wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Transcript
SAS In-Database Procedures on eBay’s Teradata System Reduces
Processing Time by a Factor of 4
Arun Akkinapalli, eBay Inc, San Jose, California
Gurudev Karanth, eBay Inc, San Jose, California
Mithun Yarlagadda, eBay Inc, San Jose, California
John Prenner, eBay Inc, San Jose, California
Abstract
In-database processing refers to the integration of data analytics into data warehousing functionality. Many statistical
computing solutions and large databases use this technology because it provides significant performance
improvements over traditional methods. Working efficiently with HUGE datasets consisting of millions of rows and
hundreds of columns summing up to gigabytes of storage, is a challenge that many users and organizations face
today. In addition to processing large amounts of data, additional constraints include end-to-end processing time,
implications of transfer of processing to the database, storage space, system resources, data transfer etc.
Utilizing SAS in-database processing on the Teradata Enterprise Data Warehouse platform for the production of
experimentation results has reduced end-to-end processing time by a factor of 4 at eBay Inc. This improvement in
overall throughput and has allowed eBay Inc in ~30% additional processing capacity, and thereby enabled evaluating
more experiments this year. This paper details the implementation through utilizing Teradata’s MPP (Massive Parallel
Processing) with SAS procedures and illustrates the differences between the standard processing and In-database
processing.
Introduction
SAS In-Database processing integrates SAS solutions, SAS analytic processes, and third-party database
management systems. Using SAS In-Database processing, certain SAS procedures, scoring models, and formatted
SQL queries can be executed inside the database.
SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select
SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture
of the database or data warehouse for scalability and better performance. Moving relevant data management,
analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data
movement and promoting better data governance. For decision makers, this means faster access to analytical results
and more agile and accurate decisions.
SAS plays a vital role in the production of experimentation results at eBay. This involves processing a large amount
of data (~250 million records with ~150 variables) within a finite time frame. The volume of data to be processed has
opened up several challenges around data preparation and transformation, data transfer between Teradata and SAS,
dataset storage and optimum system resource utilization, leading to increased dataset processing costs and
scalability bottlenecks. By leveraging SAS In-database processing, we have been able to move data processing to
where the data resides with benefits in terms of speed, reduced dataset volume, eliminating data movement and
promoting better data governance. A noticeable increase in capacity and reduction in processing time in the
production of experimentation results reaffirms the utility of this methodology.
This paper goes over the implementation details and is organized as follows:

The first section of this paper goes over the standard process, illustrates the performance metrics,
challenges supported by a use case.

The second and third sections provide an overview of In-database processing, go over usage details, and
illustrate the performance metrics supported by the same use case.

The fourth and last section goes over system resource utilization comparisons of standard process and Indatabase process, advantages of In-database processing at eBay, and future areas of optimization.
The SAS code included in this paper are functional on SAS v 9.3 in UNIX environment. Please note that the output
may differ for different settings and environments.
1
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
Standard Process
OUTLINE:
Experimentation results are produced from summarized Teradata data structures that are processed on a daily basis.
The experimentation result production process, involves creation of a SAS dataset from an underlying summarized
Teradata table. This process consists of four major steps, all are triggered by a SAS wrapper script. Initially, the
parameters necessary for analysis are entered into a text file (parameter source file) as SAS macro variables. Next, a
sequence of SQL’s on the database are executed which yields a summarized Teradata table. This table acts as the
source data for all the subsequent SAS processes. In the next step, the SAS program uses fast export process with
multiple sessions to transfer the Teradata table onto the SAS server and a SAS dataset created. The analysis is then
performed on this SAS dataset to provide test analysis at different dimensions based on multiple aggregations. The
output after analysis of each dimension is appended to the final SAS output dataset. The final step transfers the
results back from the SAS dataset to Teradata table using fast load process. This processed dataset is visualized on
different dashboards by using a BI tool for end user consumption.
The process is outlined in figure 1.
Figure 1: Standard Process
EXAMPLE:
A typical test analysis usually consists of up to 250 million records with 150 variables. The table below gives
approximate run time of different steps of the process for a dataset of 91 million records with 150 variables.
STEPS
DESCRIPTION
TIME TAKEN (minutes)
SQL’s on Teradata
Step to create a Teradata table(Input
to SAS process)
32
Transfer of Teradata table to SAS
dataset using fast export process
52
Statistical computation
Processing of data at different
dimensions
233
Transfer (SAS – Teradata)
Transfer of output SAS dataset to
Teradata table
Data Transfer
6
Total – 323
Table 1: Run times at different steps of standard process
2
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
The above table clearly shows that SQL’s take around 30 minutes to create a summarized Teradata table from the
underlying base tables. Data transfer from Teradata to SAS using fast export process with multiple sessions took
about an hour of processing time. Also note that as the volume of data increases, time increases proportionally. The
test analysis is provided for different dimensions of the SAS dataset. SAS dataset is aggregated multiple times using
PROC SQL and analysis is performed on the summary dataset created after aggregation. This part of statistical
computation takes around 4 hours. The final transfer from SAS dataset to Teradata needs around 6 minutes. On the
whole, the standard process consumed around 5 hours.
There are different challenges with this process. Firstly, this uses fast export process with multiple sessions and this
may take a lot of bandwidth on the server and data transfer may fail if the volume of data is really huge. System
resources needed to process such volume of data is more, so there is a limitation to process only one or two run’s at
a time. This will restrict to analyze limited amount of tests in a time frame. The process is not taking the advantage of
high optimized and tuned database for aggregations which in turn is increasing the run times.
Other challenges in addition to above, has led to use In-database to produce experimentation results as described in
the below sections. Sample code for this process is provided at the end of this paper.
What is In-Database Processing?
SAS In-database processing on Teradata system is a flexible and efficient way to leverage huge amounts of data by
exploiting massive parallel processing capabilities of Teradata to process different SAS procedures on the data
warehouse.
Teradata system architecture is ‘shared nothing’ environment which means each unit is parallel and independent of
other units. Many different individual servers are interconnected and act as a single large system. There are two
types of virtual processes in Teradata, Access Module Processers which do all the data crunching and Parsing
Engine which interfaces with all the applications and users to throttle the load. SAS in database processing takes
advantage of this massive parallel architecture where SAS work is transferred into each of these units that will enable
the data processing in parallel without any data transfers from SAS to Teradata.
SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select
SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture
of the database or data warehouse for scalability and better performance. Moving relevant data management,
analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data
movement and promoting better data governance. For decision makers, this means faster access to analytical results
and more agile and accurate decisions.
BASE SAS PROCEDURES AVAILABLE FOR IN-DATABASE PROCESSING
Every organization with BASE SAS 9.2M2, Teradata 12 or higher and SAS ACCESS 9.2 can take advantage of the
BASE SAS procedures listed below without having any additional components (described in the next part)







FREQ (aggregation and summarization)
MEANS (aggregation and summarization)
SUMMARY (aggregation and summarization)
RANK (Ranking and sorting)
TABULATE (aggregation and summarization)
REPORT(aggregation and summarization)
SORT(Ranking and sorting)
COMPONENTS OF IN-DATABASE PROCESSING
There are three different components of In-database processing
Scoring acceleration for Teradata
Scoring acceleration enables translation and execution of different scoring models created in SAS Enterprise Miner,
directly in the database environment. This reduces the data transfer, improves the overall performance of modeling
and automates the whole process.
3
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
The minimum system requirements for scoring accelerator of Teradata system are BASE SAS 9.2M2, SAS ACCESS
9.2, SAS Enterprise Miner, Scoring Acceleration and Teradata 12 or higher.
Analytics acceleration for Teradata
SAS analytics acceleration enables execution of key SAS/STAT and data summarization functions in the Teradata
system. This reduces the processing time to build, execute and deploy predictive models.
The SAS/STAT and SAS/ETS procedures that are enabled are








CORR (correlation).
CANCORR (canonical correlation).
FACTOR (factor).
PRINCOMP (principal components).
REG (regression analysis, including stepwise regression).
SCORE (scoring of linear models).
VARCLUS (group variables into clusters).
TIMESERIES (analyzes time-stamped transactional data and aggregates the data into a time series format
for trending and seasonal analysis
The minimum system requirements for Analytics accelerator for Teradata system are BASE SAS 9.2M2, SAS STAT
9.2M2, SAS ACCESS 9.2, SAS ANALYICS ACCELERATOR 1.2 and Teradata 12 or higher.
In database analytics with a Teradata Enterprise Data Warehouse
SAS and Teradata Corporation have created SAS and Teradata Analytic advantage program to provide In-database
analytics that will enable different organizations to minimize the risk, maximize the value and increase the efficiency.
The packages are as below:



Express - SAS Analytics Accelerator for Teradata, SAS Analytics Pro, SAS/ACCESS Interface to Teradata
and SAS Enterprise Guide
Advanced - SAS Analytics Pro, SAS/ACCESS Interface to Teradata, SAS Enterprise Miner and SAS
Scoring Accelerator for Teradata
Enterprise - SAS Analytics Accelerator for Teradata, SAS Analytics Pro, SAS/ACCESS Interface to
Teradata, SAS Enterprise Miner, SAS Model Manager and SAS Scoring Accelerator for Teradata
ADVANTAGES OF IN DATABASE PROCESSING

Reduction of Input and output costs as there is no transfer of data between SAS and Teradata.

The users will have more bandwidth on the SAS server as In-database processing avoids usage of
single/multiple sessions of fast export process for data transfer.

In most of the cases, summarizing large amounts of data on DBMS side is faster as most of them are highly
optimized, scalable and tuned.

Since work load is shared between database and SAS system, large amount of data can be processed at
the same time
In-Database Process
In this process, aggregations take place in the database and additional processing is done on SAS. This process
consists of four major steps, all are triggered by a SAS wrapper script. Initially, the parameters necessary for analysis
are extracted from Teradata table as SAS macro variables. Next, a sequence of SQL’s on the database are executed
which yields a summarized Teradata table. This table acts as the source data for all the further processes. In the next
step, the SAS program uses In-database procedure (PROC MEANS) to aggregate the Teradata table at different
dimensions and create summary SAS datasets at different levels. This dataset is used for all further processing. The
next steps are similar to standard process where the output after analysis of each dimension is appended to the final
4
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
SAS output dataset. The final step returns the results back from the SAS dataset to Teradata table using fast load
process. This processed dataset is visualized on different dashboards by using a BI tool for end user consumption
The process is outlined in figure 2.
Figure 2: In-database Process
EXAMPLE:
A typical test analysis usually consists of up to 250 million records with 150 variables. The table below gives
approximate run time of different steps of the process for a dataset of 91 million records with 150 variables.
STEPS
DESCRIPTION
TIME TAKEN (minutes)
SQL’s on Teradata
Step to create a Teradata table(Input
to SAS process)
32
Transfer of Teradata table to SAS
dataset using fast export process
0
Statistical computation
Processing of data at different
dimensions
45
Transfer (SAS – Teradata)
Transfer of output SAS dataset to
Teradata table
6
Data Transfer
Total - 83
Table 2: Run times at different steps of In-database process
The above table illustrates that SQL’s take around 30 minutes to create a Teradata table from the raw tables. There
is no data transfer from Teradata to SAS. Statistical computation takes around 45 minutes as aggregation of data is
pushed to database side and summary SAS dataset is created which acts as a source to further processing. The final
transfer from SAS dataset to Teradata needs around 6 minutes. On the whole, in-database process consumed
around 1.5 hours.
Sample code for this process is provided at the end of this paper and the process is also outlined in figure 2.
Comparison of the Standard process with the In-Database process
Initial step of running SQL’s on the database is similar in both processes as it takes around 32 minutes. The data
transfer part is completely eliminated for In-database process compared to standard process as aggregations are
performed in the database as against the SAS server. The run time taken for statistical computation has drastically
reduced for In-database process compared to standard process. This is achieved by using PROC MEANS (SAS in-
5
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
database procedure) which creates SQL using SQL generator. These SQL’s are executed on database, aggregating
the Teradata table multiple times by taking advantage of highly optimized Teradata system. On the other hand, in
standard process, the data is transferred from Teradata table to SAS dataset and aggregations are performed on this
SAS dataset. The final step remains the same in both the processes, wherein SAS dataset is loaded back into
Teradata table. The total time taken for in database process is about four times less than standard process as
illustrated with the help of bar graphs below.
Use Case: InDatabase vs Standard Processing
350
M
i
n
u
t
e
s
300
250
200
150
In-database Method
100
50
Standard Method
0
91
Volume in Millions
Figure 3: Comparison of the total time of standard and in-database processes
Table 3 and scatter chart (figure 4) below illustrates total run time of some additional use cases with varied volumes
that show consistent reduction of around 4X for in database process.
Volume
In database
process(Minutes)
Standard
process(Minutes)
Difference (Minutes)
61 million
70
251
181
79 million
78
260
182
91 million
83
323
249
112 million
104
430
326
141 million
110
522
412
Table 3: Total run time for standard and In-database processes at varied volumes
6
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
Figure 4: Comparison of run time of standard and in-database processes at varied volumes
As the above chart indicates, the standard process scales exponentially while the in-data process scales more
linearly with reduced overall processing cost.
How did In-database process help eBay Inc?
eBay Inc. benefited by leveraging the in-database process by both reduction in end-to-end processing times and also
increasing processing throughput. There has been an overall reduction in system resource utilization allowing for a
greater degree of parallelism. There has been a reduction in disk space utilization by not having to store data on the
SAS server. As the processing time has reduced, additional insights were provided for all the tests without major
increase in the run time. The process also consumed less data transfer bandwidth on both the servers, so other
processes were able to take advantage of the servers with no additional upgrades. Also, it is less prone to issues
such as timed out sessions and terminations.
Limitations of In-database processing
There are only limited SAS procedures that support in-database, so this restricts the usage to only specific areas. Indatabase processing on small volumes of data may not be efficient and in turn increase the overhead of overall
resources needed for the test analysis, so it is recommended to use standard SAS processing for such cases.
Further, it may not be a viable option if resources on the database side are already at their peak consumption, as it
may cause additional overhead to the database.
Conclusion
By combining the power of SAS with massively parallel processing enabled database technology, faster and cheaper
analytics can be achieved. The key value proposition, besides cost savings in human and infrastructure resources, is
that more data can be processed in shorter time periods with enhanced data security. With quicker turnarounds,
more iterations of refinement with experiments are possible.
Leverage In-database functionality added value to our current processes by:

Reducing processing time by a factor of 4

Enabling analytic SAS functions by moving processing to data

Increasing throughput by reduction in system resource utilization and allowing for greater degree of
parallelism

Increasing performance and reducing data movement
7
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued

Lowering the total cost of ownership

Enhancing the effectiveness of analysts, who can stay focused on higher-value tasks

Scaling SAS processes inside the Teradata system

Optimizing resource utilization across the analytic/warehousing environment

Improving data quality and consistency

Increasing process robustness by reducing implementation restrictions
REFERENCES

Scott Mebust and Robert S. Ray Mebust. 2010. “SAS Presents In-Database Procedures in Practice.”
Proceedings of the SAS Global Forum 2010 Conference. Paper 300-2010. Cary, NC: SAS Institute Inc
Available at http://support.sas.com/resources/papers/proceedings10/300-2010.pdf

SAS Institute (2012). SAS 9.3 Documentation, SAS® 9.3 In-database Products:
http://support.sas.com/documentation/cdl/en/indbug/64690/PDF/default/indbug.pdf

SAS Institute (2012). SAS® High performance analytics, Available at:
http://www.sas.com/software/high-performance-analytics/in-database-processing/index.html

David Shamlin and David Duling. 2009. “In-Database Procedures with Teradata: How they work and What
they buy you”. Paper 337-2009. Cary, NC: SAS Institute Inc Available at
http://support.sas.com/resources/papers/sgf09/337-2009.pdf
ACKNOWLEDGMENTS
The authors are thankful to eBay for allowing them to use the valuable information. Special thanks to John
Scheibmeir and Daryl Tilley for their help all through the endeavor.
RECOMMENDED READING

http://support.sas.com/resources/papers/proceedings10/300-2010.pdf

http://support.sas.com/resources/papers/sgf09/337-2009.pdf

http://support.sas.com/documentation/cdl/en/indbug/64690/PDF/default/indbug.pdf
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Arunkumar Akkinapalli
eBay Inc
st
2525 North 1 street
San Jose, CA, 95131
[email protected]
Karanth Gurudev
eBay Inc
2145 Hamilton Avenue
San Jose CA, 95125
[email protected]
Mithun Yarlagadda
eBay Inc
2145 Hamilton Avenue
San Jose CA, 95125
[email protected]
John Prenner
eBay Inc
2145 Hamilton Avenue
San Jose CA, 95125
[email protected]
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are trademarks of their respective companies.
8
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
SAMPLE CODE:
STANDARD PROCESS
options
options
options
options
options
options
options
options
macrogen mprint serror nosymbolgen;
mergenoby = warn;
nocenter;
compress = Y;
fullstimer;
cleanup;
replace;
mautosource;
%let tname = adhoc;
/* fast export process with multiple sessions */
libname library ‘/sas/xxx/xxxxxxxx’;
option dbsliceparm=(threaded_apps,8);
PROC SQL;
CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=XXXXXXX logdb =
XXXXXX fastexport=yes TDPID="xxxxxx" mode = teradata);
create table library.sas_dataset as select * from connection to td
(SELECT * from xxxxxx.table_name a);
QUIT;
/* aggregating the sas dataset to create a summary SAS dataset */
PROC SQL noprint threads;
create table wm2 as
select week,
trmt_id,
,coalesce(sum(metric_1),0) as metric_1
,uid
from library.sas_dataset
group by
week,
trmt_id,
uid having max(trd_uid) in (1,0);
QUIT;
PROC MEANS DATA=wm2
FW=12
NWAY
VARDEF=DF
MEAN
STD
SUM
MEDIAN NOPRINT;
VAR metric_1;
CLASS week;
CLASS trmt_id;
OUTPUT OUT= a
MEAN()=
STD()=
9
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
sum()=
Median()=
/ AUTONAME AUTOLABEL WAYS;
RUN;
/* appending the output of test analysis on summary dataset ‘a’ */
DATA library.final&tname.;
SET library.final&tname. library.summary_&tname.;
RUN;
/* inserting the SAS dataset into Teradata table using fast load process */
libname trlib teradata USER=&username PASSWORD= &pwd database = xxxxxxx tdpid =
"XXXXX";
proc sql;
create table trlib.int_&tname.(fastload=yes) as select * from library.final&tname.;
proc sql;
CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=xxxxx TDPID="xxxxxx"
mode = teradata);
execute
(insert into TEST_ANALYSIS_FINAL
SELECT * from int_&tname) by td;
Quit;
IN-DATABASE PROCESS:
options macrogen mprint serror symbolgen;
options mergenoby = warn;
options nocenter;
options compress = Y;
options fullstimer;
options cleanup;
options replace;
options mautosource;
options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d'
sastraceloc=saslog nostsuffix;
/* Options that trigger in-database process */
%let tname = adhoc;
/* aggregating the Teradata table to create summary sas dataset */
libname indb teradata USER = &username PASSWORD = "&pwd" database = &pet_space. tdpid
= “XXXXXX”;
proc sql;
CONNECT TO TERADATA AS TD (USER=&username PASSWORD= "&pwd" DATABASE=xxxxx
TDPID="vivaldi" mode = teradata);
execute
(INSERT INTO xxxxx.EXL_WHL_MKT_&tname._W
select
week,
trmt_id,
user_id,
sum(metric_1) as metric_1
10
SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued
from xxxxxx.table_name group by 1,2,3 having max(trd_uid) in (1,0);
quit;
PROC MEANS DATA = indb.EXL_WHL_MKT_&tname._W
FW=12
NWAY
VARDEF=DF
N
MEAN
STD
SUM
NOPRINT;
VAR metric_1;
CLASS week;
CLASS trmt_id;
OUTPUT OUT= a
N()=
MEAN()=
STD()=
sum()=
/ AUTONAME AUTOLABEL WAYS;
RUN;
/* appending the output of test analysis on summary dataset ‘a’ */
DATA library.final&tname.;
SET library.final&tname. library.summary_&tname.;
RUN;
/* inserting the SAS dataset into Teradata table using fast load process */
libname trlib teradata USER=&username PASSWORD= &pwd database = xxxxxxx tdpid =
"XXXXX";
proc sql;
create table trlib.int_&tname(fastload=yes) as select * from library.final&tname.;
proc sql;
CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=xxxxx TDPID="xxxxxx"
mode = teradata);
execute
(insert into TEST_ANALYSIS_FINAL
SELECT * from int_&tname) by td;
Quit;
11