Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microsoft SQL Server wikipedia , lookup
Microsoft Jet Database Engine wikipedia , lookup
Open Database Connectivity wikipedia , lookup
Extensible Storage Engine wikipedia , lookup
Relational model wikipedia , lookup
Clusterpoint wikipedia , lookup
Functional Database Model wikipedia , lookup
SAS In-Database Procedures on eBay’s Teradata System Reduces Processing Time by a Factor of 4 Arun Akkinapalli, eBay Inc, San Jose, California Gurudev Karanth, eBay Inc, San Jose, California Mithun Yarlagadda, eBay Inc, San Jose, California John Prenner, eBay Inc, San Jose, California Abstract In-database processing refers to the integration of data analytics into data warehousing functionality. Many statistical computing solutions and large databases use this technology because it provides significant performance improvements over traditional methods. Working efficiently with HUGE datasets consisting of millions of rows and hundreds of columns summing up to gigabytes of storage, is a challenge that many users and organizations face today. In addition to processing large amounts of data, additional constraints include end-to-end processing time, implications of transfer of processing to the database, storage space, system resources, data transfer etc. Utilizing SAS in-database processing on the Teradata Enterprise Data Warehouse platform for the production of experimentation results has reduced end-to-end processing time by a factor of 4 at eBay Inc. This improvement in overall throughput and has allowed eBay Inc in ~30% additional processing capacity, and thereby enabled evaluating more experiments this year. This paper details the implementation through utilizing Teradata’s MPP (Massive Parallel Processing) with SAS procedures and illustrates the differences between the standard processing and In-database processing. Introduction SAS In-Database processing integrates SAS solutions, SAS analytic processes, and third-party database management systems. Using SAS In-Database processing, certain SAS procedures, scoring models, and formatted SQL queries can be executed inside the database. SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture of the database or data warehouse for scalability and better performance. Moving relevant data management, analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data movement and promoting better data governance. For decision makers, this means faster access to analytical results and more agile and accurate decisions. SAS plays a vital role in the production of experimentation results at eBay. This involves processing a large amount of data (~250 million records with ~150 variables) within a finite time frame. The volume of data to be processed has opened up several challenges around data preparation and transformation, data transfer between Teradata and SAS, dataset storage and optimum system resource utilization, leading to increased dataset processing costs and scalability bottlenecks. By leveraging SAS In-database processing, we have been able to move data processing to where the data resides with benefits in terms of speed, reduced dataset volume, eliminating data movement and promoting better data governance. A noticeable increase in capacity and reduction in processing time in the production of experimentation results reaffirms the utility of this methodology. This paper goes over the implementation details and is organized as follows: The first section of this paper goes over the standard process, illustrates the performance metrics, challenges supported by a use case. The second and third sections provide an overview of In-database processing, go over usage details, and illustrate the performance metrics supported by the same use case. The fourth and last section goes over system resource utilization comparisons of standard process and Indatabase process, advantages of In-database processing at eBay, and future areas of optimization. The SAS code included in this paper are functional on SAS v 9.3 in UNIX environment. Please note that the output may differ for different settings and environments. 1 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued Standard Process OUTLINE: Experimentation results are produced from summarized Teradata data structures that are processed on a daily basis. The experimentation result production process, involves creation of a SAS dataset from an underlying summarized Teradata table. This process consists of four major steps, all are triggered by a SAS wrapper script. Initially, the parameters necessary for analysis are entered into a text file (parameter source file) as SAS macro variables. Next, a sequence of SQL’s on the database are executed which yields a summarized Teradata table. This table acts as the source data for all the subsequent SAS processes. In the next step, the SAS program uses fast export process with multiple sessions to transfer the Teradata table onto the SAS server and a SAS dataset created. The analysis is then performed on this SAS dataset to provide test analysis at different dimensions based on multiple aggregations. The output after analysis of each dimension is appended to the final SAS output dataset. The final step transfers the results back from the SAS dataset to Teradata table using fast load process. This processed dataset is visualized on different dashboards by using a BI tool for end user consumption. The process is outlined in figure 1. Figure 1: Standard Process EXAMPLE: A typical test analysis usually consists of up to 250 million records with 150 variables. The table below gives approximate run time of different steps of the process for a dataset of 91 million records with 150 variables. STEPS DESCRIPTION TIME TAKEN (minutes) SQL’s on Teradata Step to create a Teradata table(Input to SAS process) 32 Transfer of Teradata table to SAS dataset using fast export process 52 Statistical computation Processing of data at different dimensions 233 Transfer (SAS – Teradata) Transfer of output SAS dataset to Teradata table Data Transfer 6 Total – 323 Table 1: Run times at different steps of standard process 2 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued The above table clearly shows that SQL’s take around 30 minutes to create a summarized Teradata table from the underlying base tables. Data transfer from Teradata to SAS using fast export process with multiple sessions took about an hour of processing time. Also note that as the volume of data increases, time increases proportionally. The test analysis is provided for different dimensions of the SAS dataset. SAS dataset is aggregated multiple times using PROC SQL and analysis is performed on the summary dataset created after aggregation. This part of statistical computation takes around 4 hours. The final transfer from SAS dataset to Teradata needs around 6 minutes. On the whole, the standard process consumed around 5 hours. There are different challenges with this process. Firstly, this uses fast export process with multiple sessions and this may take a lot of bandwidth on the server and data transfer may fail if the volume of data is really huge. System resources needed to process such volume of data is more, so there is a limitation to process only one or two run’s at a time. This will restrict to analyze limited amount of tests in a time frame. The process is not taking the advantage of high optimized and tuned database for aggregations which in turn is increasing the run times. Other challenges in addition to above, has led to use In-database to produce experimentation results as described in the below sections. Sample code for this process is provided at the end of this paper. What is In-Database Processing? SAS In-database processing on Teradata system is a flexible and efficient way to leverage huge amounts of data by exploiting massive parallel processing capabilities of Teradata to process different SAS procedures on the data warehouse. Teradata system architecture is ‘shared nothing’ environment which means each unit is parallel and independent of other units. Many different individual servers are interconnected and act as a single large system. There are two types of virtual processes in Teradata, Access Module Processers which do all the data crunching and Parsing Engine which interfaces with all the applications and users to throttle the load. SAS in database processing takes advantage of this massive parallel architecture where SAS work is transferred into each of these units that will enable the data processing in parallel without any data transfers from SAS to Teradata. SAS In-Database processing is a flexible, efficient way to leverage increasing amounts of data by integrating select SAS technology into databases or data warehouses. It utilizes the massively parallel processing (MPP) architecture of the database or data warehouse for scalability and better performance. Moving relevant data management, analytics and reporting tasks to where the data resides is beneficial in terms of speed, reducing unnecessary data movement and promoting better data governance. For decision makers, this means faster access to analytical results and more agile and accurate decisions. BASE SAS PROCEDURES AVAILABLE FOR IN-DATABASE PROCESSING Every organization with BASE SAS 9.2M2, Teradata 12 or higher and SAS ACCESS 9.2 can take advantage of the BASE SAS procedures listed below without having any additional components (described in the next part) FREQ (aggregation and summarization) MEANS (aggregation and summarization) SUMMARY (aggregation and summarization) RANK (Ranking and sorting) TABULATE (aggregation and summarization) REPORT(aggregation and summarization) SORT(Ranking and sorting) COMPONENTS OF IN-DATABASE PROCESSING There are three different components of In-database processing Scoring acceleration for Teradata Scoring acceleration enables translation and execution of different scoring models created in SAS Enterprise Miner, directly in the database environment. This reduces the data transfer, improves the overall performance of modeling and automates the whole process. 3 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued The minimum system requirements for scoring accelerator of Teradata system are BASE SAS 9.2M2, SAS ACCESS 9.2, SAS Enterprise Miner, Scoring Acceleration and Teradata 12 or higher. Analytics acceleration for Teradata SAS analytics acceleration enables execution of key SAS/STAT and data summarization functions in the Teradata system. This reduces the processing time to build, execute and deploy predictive models. The SAS/STAT and SAS/ETS procedures that are enabled are CORR (correlation). CANCORR (canonical correlation). FACTOR (factor). PRINCOMP (principal components). REG (regression analysis, including stepwise regression). SCORE (scoring of linear models). VARCLUS (group variables into clusters). TIMESERIES (analyzes time-stamped transactional data and aggregates the data into a time series format for trending and seasonal analysis The minimum system requirements for Analytics accelerator for Teradata system are BASE SAS 9.2M2, SAS STAT 9.2M2, SAS ACCESS 9.2, SAS ANALYICS ACCELERATOR 1.2 and Teradata 12 or higher. In database analytics with a Teradata Enterprise Data Warehouse SAS and Teradata Corporation have created SAS and Teradata Analytic advantage program to provide In-database analytics that will enable different organizations to minimize the risk, maximize the value and increase the efficiency. The packages are as below: Express - SAS Analytics Accelerator for Teradata, SAS Analytics Pro, SAS/ACCESS Interface to Teradata and SAS Enterprise Guide Advanced - SAS Analytics Pro, SAS/ACCESS Interface to Teradata, SAS Enterprise Miner and SAS Scoring Accelerator for Teradata Enterprise - SAS Analytics Accelerator for Teradata, SAS Analytics Pro, SAS/ACCESS Interface to Teradata, SAS Enterprise Miner, SAS Model Manager and SAS Scoring Accelerator for Teradata ADVANTAGES OF IN DATABASE PROCESSING Reduction of Input and output costs as there is no transfer of data between SAS and Teradata. The users will have more bandwidth on the SAS server as In-database processing avoids usage of single/multiple sessions of fast export process for data transfer. In most of the cases, summarizing large amounts of data on DBMS side is faster as most of them are highly optimized, scalable and tuned. Since work load is shared between database and SAS system, large amount of data can be processed at the same time In-Database Process In this process, aggregations take place in the database and additional processing is done on SAS. This process consists of four major steps, all are triggered by a SAS wrapper script. Initially, the parameters necessary for analysis are extracted from Teradata table as SAS macro variables. Next, a sequence of SQL’s on the database are executed which yields a summarized Teradata table. This table acts as the source data for all the further processes. In the next step, the SAS program uses In-database procedure (PROC MEANS) to aggregate the Teradata table at different dimensions and create summary SAS datasets at different levels. This dataset is used for all further processing. The next steps are similar to standard process where the output after analysis of each dimension is appended to the final 4 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued SAS output dataset. The final step returns the results back from the SAS dataset to Teradata table using fast load process. This processed dataset is visualized on different dashboards by using a BI tool for end user consumption The process is outlined in figure 2. Figure 2: In-database Process EXAMPLE: A typical test analysis usually consists of up to 250 million records with 150 variables. The table below gives approximate run time of different steps of the process for a dataset of 91 million records with 150 variables. STEPS DESCRIPTION TIME TAKEN (minutes) SQL’s on Teradata Step to create a Teradata table(Input to SAS process) 32 Transfer of Teradata table to SAS dataset using fast export process 0 Statistical computation Processing of data at different dimensions 45 Transfer (SAS – Teradata) Transfer of output SAS dataset to Teradata table 6 Data Transfer Total - 83 Table 2: Run times at different steps of In-database process The above table illustrates that SQL’s take around 30 minutes to create a Teradata table from the raw tables. There is no data transfer from Teradata to SAS. Statistical computation takes around 45 minutes as aggregation of data is pushed to database side and summary SAS dataset is created which acts as a source to further processing. The final transfer from SAS dataset to Teradata needs around 6 minutes. On the whole, in-database process consumed around 1.5 hours. Sample code for this process is provided at the end of this paper and the process is also outlined in figure 2. Comparison of the Standard process with the In-Database process Initial step of running SQL’s on the database is similar in both processes as it takes around 32 minutes. The data transfer part is completely eliminated for In-database process compared to standard process as aggregations are performed in the database as against the SAS server. The run time taken for statistical computation has drastically reduced for In-database process compared to standard process. This is achieved by using PROC MEANS (SAS in- 5 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued database procedure) which creates SQL using SQL generator. These SQL’s are executed on database, aggregating the Teradata table multiple times by taking advantage of highly optimized Teradata system. On the other hand, in standard process, the data is transferred from Teradata table to SAS dataset and aggregations are performed on this SAS dataset. The final step remains the same in both the processes, wherein SAS dataset is loaded back into Teradata table. The total time taken for in database process is about four times less than standard process as illustrated with the help of bar graphs below. Use Case: InDatabase vs Standard Processing 350 M i n u t e s 300 250 200 150 In-database Method 100 50 Standard Method 0 91 Volume in Millions Figure 3: Comparison of the total time of standard and in-database processes Table 3 and scatter chart (figure 4) below illustrates total run time of some additional use cases with varied volumes that show consistent reduction of around 4X for in database process. Volume In database process(Minutes) Standard process(Minutes) Difference (Minutes) 61 million 70 251 181 79 million 78 260 182 91 million 83 323 249 112 million 104 430 326 141 million 110 522 412 Table 3: Total run time for standard and In-database processes at varied volumes 6 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued Figure 4: Comparison of run time of standard and in-database processes at varied volumes As the above chart indicates, the standard process scales exponentially while the in-data process scales more linearly with reduced overall processing cost. How did In-database process help eBay Inc? eBay Inc. benefited by leveraging the in-database process by both reduction in end-to-end processing times and also increasing processing throughput. There has been an overall reduction in system resource utilization allowing for a greater degree of parallelism. There has been a reduction in disk space utilization by not having to store data on the SAS server. As the processing time has reduced, additional insights were provided for all the tests without major increase in the run time. The process also consumed less data transfer bandwidth on both the servers, so other processes were able to take advantage of the servers with no additional upgrades. Also, it is less prone to issues such as timed out sessions and terminations. Limitations of In-database processing There are only limited SAS procedures that support in-database, so this restricts the usage to only specific areas. Indatabase processing on small volumes of data may not be efficient and in turn increase the overhead of overall resources needed for the test analysis, so it is recommended to use standard SAS processing for such cases. Further, it may not be a viable option if resources on the database side are already at their peak consumption, as it may cause additional overhead to the database. Conclusion By combining the power of SAS with massively parallel processing enabled database technology, faster and cheaper analytics can be achieved. The key value proposition, besides cost savings in human and infrastructure resources, is that more data can be processed in shorter time periods with enhanced data security. With quicker turnarounds, more iterations of refinement with experiments are possible. Leverage In-database functionality added value to our current processes by: Reducing processing time by a factor of 4 Enabling analytic SAS functions by moving processing to data Increasing throughput by reduction in system resource utilization and allowing for greater degree of parallelism Increasing performance and reducing data movement 7 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued Lowering the total cost of ownership Enhancing the effectiveness of analysts, who can stay focused on higher-value tasks Scaling SAS processes inside the Teradata system Optimizing resource utilization across the analytic/warehousing environment Improving data quality and consistency Increasing process robustness by reducing implementation restrictions REFERENCES Scott Mebust and Robert S. Ray Mebust. 2010. “SAS Presents In-Database Procedures in Practice.” Proceedings of the SAS Global Forum 2010 Conference. Paper 300-2010. Cary, NC: SAS Institute Inc Available at http://support.sas.com/resources/papers/proceedings10/300-2010.pdf SAS Institute (2012). SAS 9.3 Documentation, SAS® 9.3 In-database Products: http://support.sas.com/documentation/cdl/en/indbug/64690/PDF/default/indbug.pdf SAS Institute (2012). SAS® High performance analytics, Available at: http://www.sas.com/software/high-performance-analytics/in-database-processing/index.html David Shamlin and David Duling. 2009. “In-Database Procedures with Teradata: How they work and What they buy you”. Paper 337-2009. Cary, NC: SAS Institute Inc Available at http://support.sas.com/resources/papers/sgf09/337-2009.pdf ACKNOWLEDGMENTS The authors are thankful to eBay for allowing them to use the valuable information. Special thanks to John Scheibmeir and Daryl Tilley for their help all through the endeavor. RECOMMENDED READING http://support.sas.com/resources/papers/proceedings10/300-2010.pdf http://support.sas.com/resources/papers/sgf09/337-2009.pdf http://support.sas.com/documentation/cdl/en/indbug/64690/PDF/default/indbug.pdf CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Arunkumar Akkinapalli eBay Inc st 2525 North 1 street San Jose, CA, 95131 [email protected] Karanth Gurudev eBay Inc 2145 Hamilton Avenue San Jose CA, 95125 [email protected] Mithun Yarlagadda eBay Inc 2145 Hamilton Avenue San Jose CA, 95125 [email protected] John Prenner eBay Inc 2145 Hamilton Avenue San Jose CA, 95125 [email protected] SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies. 8 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued SAMPLE CODE: STANDARD PROCESS options options options options options options options options macrogen mprint serror nosymbolgen; mergenoby = warn; nocenter; compress = Y; fullstimer; cleanup; replace; mautosource; %let tname = adhoc; /* fast export process with multiple sessions */ libname library ‘/sas/xxx/xxxxxxxx’; option dbsliceparm=(threaded_apps,8); PROC SQL; CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=XXXXXXX logdb = XXXXXX fastexport=yes TDPID="xxxxxx" mode = teradata); create table library.sas_dataset as select * from connection to td (SELECT * from xxxxxx.table_name a); QUIT; /* aggregating the sas dataset to create a summary SAS dataset */ PROC SQL noprint threads; create table wm2 as select week, trmt_id, ,coalesce(sum(metric_1),0) as metric_1 ,uid from library.sas_dataset group by week, trmt_id, uid having max(trd_uid) in (1,0); QUIT; PROC MEANS DATA=wm2 FW=12 NWAY VARDEF=DF MEAN STD SUM MEDIAN NOPRINT; VAR metric_1; CLASS week; CLASS trmt_id; OUTPUT OUT= a MEAN()= STD()= 9 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued sum()= Median()= / AUTONAME AUTOLABEL WAYS; RUN; /* appending the output of test analysis on summary dataset ‘a’ */ DATA library.final&tname.; SET library.final&tname. library.summary_&tname.; RUN; /* inserting the SAS dataset into Teradata table using fast load process */ libname trlib teradata USER=&username PASSWORD= &pwd database = xxxxxxx tdpid = "XXXXX"; proc sql; create table trlib.int_&tname.(fastload=yes) as select * from library.final&tname.; proc sql; CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=xxxxx TDPID="xxxxxx" mode = teradata); execute (insert into TEST_ANALYSIS_FINAL SELECT * from int_&tname) by td; Quit; IN-DATABASE PROCESS: options macrogen mprint serror symbolgen; options mergenoby = warn; options nocenter; options compress = Y; options fullstimer; options cleanup; options replace; options mautosource; options SQLGENERATION=DBMS MSGLEVEL=I sastrace=',,,d' sastraceloc=saslog nostsuffix; /* Options that trigger in-database process */ %let tname = adhoc; /* aggregating the Teradata table to create summary sas dataset */ libname indb teradata USER = &username PASSWORD = "&pwd" database = &pet_space. tdpid = “XXXXXX”; proc sql; CONNECT TO TERADATA AS TD (USER=&username PASSWORD= "&pwd" DATABASE=xxxxx TDPID="vivaldi" mode = teradata); execute (INSERT INTO xxxxx.EXL_WHL_MKT_&tname._W select week, trmt_id, user_id, sum(metric_1) as metric_1 10 SAS In-database procedures on eBay’s teradata system reduces processing time by a factor of 4, continued from xxxxxx.table_name group by 1,2,3 having max(trd_uid) in (1,0); quit; PROC MEANS DATA = indb.EXL_WHL_MKT_&tname._W FW=12 NWAY VARDEF=DF N MEAN STD SUM NOPRINT; VAR metric_1; CLASS week; CLASS trmt_id; OUTPUT OUT= a N()= MEAN()= STD()= sum()= / AUTONAME AUTOLABEL WAYS; RUN; /* appending the output of test analysis on summary dataset ‘a’ */ DATA library.final&tname.; SET library.final&tname. library.summary_&tname.; RUN; /* inserting the SAS dataset into Teradata table using fast load process */ libname trlib teradata USER=&username PASSWORD= &pwd database = xxxxxxx tdpid = "XXXXX"; proc sql; create table trlib.int_&tname(fastload=yes) as select * from library.final&tname.; proc sql; CONNECT TO TERADATA AS TD (USER=&username PASSWORD= &PWD DATABASE=xxxxx TDPID="xxxxxx" mode = teradata); execute (insert into TEST_ANALYSIS_FINAL SELECT * from int_&tname) by td; Quit; 11