Download Overview of In-database Processing

Overview of In-database Processing Systems Seminar Consultants | www.sys-seminar.com| 608-278-9964 An exciting enhancement to SAS was included in SAS version 9.2. In this release the SAS/Access product was enhanced to enable in-database processing. The new in-database technology allows select procedures to process inside of the database. This technology allows the programmer to use the power of SAS while leveraging the scalability and processing power of the database platform for select procedures. Prior to SAS version 9.2 the procedures utilize conventional processing where most of the work is performed by the machine running SAS. When analyzing data from the database, the selected data is transferred from the database table to local memory and storage. All processing is completed by the procedure in SAS. Since data must first be transferred from the database, large tables mean a significant amount of resources are utilized for the transfer. As a result, performance and space issues can be encountered. In-database technology was introduced in version 9.2 for Teradata, DB2 under UNIX and PC, and Oracle. Netezza was included with the release of version 9.3. Seven base procedures have in-database capabilities: FREQ SORT MEANS SUMMARY RANK TABULATE REPORT Additional procedures are available for Teradata. These additional procedures require SAS Analytics Accelerator to run in the database. In-database processing must be turned on via the SQLGENERATION option: option sqlgeneration=dbms; This option may also be set on the LIBNAME statement for specific database references: libname teralib teradata user=dbuser password=dbpwd database=tera_tables sqlgeneration=dbms; When in-database processing is turned on and an eligible procedure is executed, SAS generates SQL. This query is the SQL equivalent to the procedure calculations. The query is passed to the database via pass-through. The database executes the query and returns the result set to SAS. Depending on the procedure and the statements and options specified, SQL may not be able to perform the entire analysis and additional analysis may be required by SAS. After the query results are returned to SAS the procedure completes the remaining analysis and produces the final SAS data set or report. Since the database is executing the query and returning only the results, the amount of data transferred is reduced. The reduction in data movement improves the overall performance of the procedure. The smaller volume of data transferred across the network frees up bandwidth for other uses. The transfer of the workload to the database hardware rather than the PC running SAS provides a better use of computing resources. How do I know the procedure ran in the database? When looking at the log, the notes after the procedure will provide information on the number of records fetched from the database. If the table is millions of records and this count is only a few dozen records, then the processing was most likely done within the database. Often though, we are looking for a more definitive message to tell us where the processing took place. Changing the message level to I allows additional notes about in-database processing to be printed. option msglevel=i; SASTRACE may also be turned on to display the query passed to the database in the log. option sastrace=’,,,d’ sastraceloc=saslog nostsuffix; Once these options are turned on the log messages for a PROC FREQ executed with in-database processing will appear as follows: 10 11 12 proc freq data=teralib.employee; table storeno; run; NOTE: SQL generation will be used to construct frequency and crosstabulation tables. TERADATA: Executed: on connection 4 select COUNT(*) as “ZSQL1”, case when COUNT(*) > COUNT(TXT_1.”storeno”) then ‘ ‘ else MIN(TXT_1.”storeno”) as “ZSQL2” from “sas”.employee” TXT_1 group by TXT_1.”storeno” TERADATA: trget – rows to fetch: 9 PROC MEANS Conventional Processing vs. In-database Processing Within our program we have a simple PROC MEANS: proc means data=teralib.callcenter min max; var sales; class division; run; Execution of the procedure produces the following report: The MEANS Procedure Analysis Variable : SALES N DIVISION Obs Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ C 23 98.11 5691.78 H 52 127.52 15689.34 S 48 12.73 2418.96 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ If this procedure is executed using conventional processing all 123 records are transferred from the database table CALLCENTER using the following query: SELECT “sales” , “division” FROM sas.”callcenter” The summarization and reporting of the data is completed within SAS. On the other hand, if this procedure is executed using in-database processing the summarization occurs within the database. SAS generates the following query for the database to execute: SELECT COUNT(*) as “ZSQL1” , MIN(TXT_1.“division”) as “ZSQL2” , COUNT(*) as “ZSQL3” , COUNT(TXT_1.“sales”) as “ZSQL4” , MIN(TXT_1.“sales”) as “ZSQL5” , MAX(TXT_1.“sales”) as “ZSQL6” FROM sas.”callcenter” TXT_1 GROUP BY TXT_1.“division” In this instance, the result set is only three records to be transferred back to SAS. Since the summarization is complete, the SAS processing only needs to format the final report. Limitations In-database processing does have a few limitations. Data set options such as RENAME=, OBS=, and FIRSTOBS= will prevent in-database processing. Each procedure may also have limitations specific to the procedure. Refer to the SAS documentation for a specific procedure to determine in-database supported statements and options. Although in-database processing may be turned on, if an unsupported statement or option is encountered the procedure will execute using conventional processing. Efficient Use of In-database Processing While in-database processing adds efficiency to our SAS programs by transferring the workload to the database, there are still steps we can take to ensure we are writing the most efficient program possible. Following common efficiency practices such as subsetting the data with the use of WHERE and limiting the variables selected by using DROP and KEEP will improve the overall performance. Awareness of what is happening in the database and what prevents in-database operations is important. It may be necessary to consider procedure level adjustments to the program to fully utilize the technology. Finally, it is important to be aware of times when in-database processing should not be used. When the procedure is working with a small volume of data or the aggregation does little to reduce the data volume there is little to gain from in-database processing. Conclusion In-database processing is a new technology that can add efficiency to SAS programs by reducing the amount of data transferred between a database and the SAS environment. Use of this technology allows the programmer to use the power of SAS while leveraging the scalability and processing power of the database platform.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Overview of In-database Processing