Download Overview of In-database Processing

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Entity–attribute–value model wikipedia , lookup

SQL wikipedia , lookup

IMDb wikipedia , lookup

Microsoft SQL Server wikipedia , lookup

Oracle Database wikipedia , lookup

Extensible Storage Engine wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Ingres (database) wikipedia , lookup

PL/SQL wikipedia , lookup

Concurrency control wikipedia , lookup

Functional Database Model wikipedia , lookup

Microsoft Jet Database Engine wikipedia , lookup

Database wikipedia , lookup

Relational model wikipedia , lookup

ContactPoint wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Transcript
Overview of In-database Processing
Systems Seminar Consultants | www.sys-seminar.com| 608-278-9964
An exciting enhancement to SAS was included in SAS version 9.2. In this release the SAS/Access product
was enhanced to enable in-database processing. The new in-database technology allows select
procedures to process inside of the database. This technology allows the programmer to use the power
of SAS while leveraging the scalability and processing power of the database platform for select
procedures.
Prior to SAS version 9.2 the procedures utilize conventional processing where most of the work is
performed by the machine running SAS. When analyzing data from the database, the selected data is
transferred from the database table to local memory and storage. All processing is completed by the
procedure in SAS. Since data must first be transferred from the database, large tables mean a significant
amount of resources are utilized for the transfer. As a result, performance and space issues can be
encountered.
In-database technology was introduced in version 9.2 for Teradata, DB2 under UNIX and PC, and Oracle.
Netezza was included with the release of version 9.3. Seven base procedures have in-database
capabilities:
FREQ
SORT
MEANS
SUMMARY
RANK
TABULATE
REPORT
Additional procedures are available for Teradata. These additional procedures require SAS Analytics
Accelerator to run in the database.
In-database processing must be turned on via the SQLGENERATION option:
option sqlgeneration=dbms;
This option may also be set on the LIBNAME statement for specific database references:
libname teralib teradata user=dbuser
password=dbpwd
database=tera_tables
sqlgeneration=dbms;
When in-database processing is turned on and an eligible procedure is executed, SAS generates SQL.
This query is the SQL equivalent to the procedure calculations. The query is passed to the database via
pass-through. The database executes the query and returns the result set to SAS. Depending on the
procedure and the statements and options specified, SQL may not be able to perform the entire analysis
and additional analysis may be required by SAS. After the query results are returned to SAS the
procedure completes the remaining analysis and produces the final SAS data set or report.
Since the database is executing the query and returning only the results, the amount of data transferred
is reduced. The reduction in data movement improves the overall performance of the procedure. The
smaller volume of data transferred across the network frees up bandwidth for other uses. The transfer
of the workload to the database hardware rather than the PC running SAS provides a better use of
computing resources.
How do I know the procedure ran in the database?
When looking at the log, the notes after the procedure will provide information on the number of
records fetched from the database. If the table is millions of records and this count is only a few dozen
records, then the processing was most likely done within the database. Often though, we are looking
for a more definitive message to tell us where the processing took place. Changing the message level to
I allows additional notes about in-database processing to be printed.
option msglevel=i;
SASTRACE may also be turned on to display the query passed to the database in the log.
option sastrace=’,,,d’ sastraceloc=saslog nostsuffix;
Once these options are turned on the log messages for a PROC FREQ executed with in-database
processing will appear as follows:
10
11
12
proc freq data=teralib.employee;
table storeno;
run;
NOTE: SQL generation will be used to construct frequency and
crosstabulation tables.
TERADATA: Executed: on connection 4
select COUNT(*) as “ZSQL1”, case when COUNT(*) >
COUNT(TXT_1.”storeno”) then ‘ ‘ else MIN(TXT_1.”storeno”) as
“ZSQL2” from “sas”.employee” TXT_1 group by TXT_1.”storeno”
TERADATA: trget – rows to fetch: 9
PROC MEANS Conventional Processing vs. In-database Processing
Within our program we have a simple PROC MEANS:
proc means data=teralib.callcenter min max;
var sales;
class division;
run;
Execution of the procedure produces the following report:
The MEANS Procedure
Analysis Variable : SALES
N
DIVISION
Obs
Minimum
Maximum
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
C
23
98.11
5691.78
H
52
127.52
15689.34
S
48
12.73
2418.96
ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ
If this procedure is executed using conventional processing all 123 records are transferred from the
database table CALLCENTER using the following query:
SELECT “sales”
, “division”
FROM sas.”callcenter”
The summarization and reporting of the data is completed within SAS.
On the other hand, if this procedure is executed using in-database processing the summarization occurs
within the database. SAS generates the following query for the database to execute:
SELECT COUNT(*) as “ZSQL1”
, MIN(TXT_1.“division”) as “ZSQL2”
, COUNT(*) as “ZSQL3”
, COUNT(TXT_1.“sales”) as “ZSQL4”
, MIN(TXT_1.“sales”) as “ZSQL5”
, MAX(TXT_1.“sales”) as “ZSQL6”
FROM sas.”callcenter” TXT_1
GROUP BY TXT_1.“division”
In this instance, the result set is only three records to be transferred back to SAS. Since the
summarization is complete, the SAS processing only needs to format the final report.
Limitations
In-database processing does have a few limitations. Data set options such as RENAME=, OBS=, and
FIRSTOBS= will prevent in-database processing. Each procedure may also have limitations specific to the
procedure. Refer to the SAS documentation for a specific procedure to determine in-database
supported statements and options. Although in-database processing may be turned on, if an
unsupported statement or option is encountered the procedure will execute using conventional
processing.
Efficient Use of In-database Processing
While in-database processing adds efficiency to our SAS programs by transferring the workload to the
database, there are still steps we can take to ensure we are writing the most efficient program possible.
Following common efficiency practices such as subsetting the data with the use of WHERE and limiting
the variables selected by using DROP and KEEP will improve the overall performance. Awareness of
what is happening in the database and what prevents in-database operations is important. It may be
necessary to consider procedure level adjustments to the program to fully utilize the technology.
Finally, it is important to be aware of times when in-database processing should not be used. When the
procedure is working with a small volume of data or the aggregation does little to reduce the data
volume there is little to gain from in-database processing.
Conclusion
In-database processing is a new technology that can add efficiency to SAS programs by reducing the
amount of data transferred between a database and the SAS environment. Use of this technology
allows the programmer to use the power of SAS while leveraging the scalability and processing power of
the database platform.