Download Tips for Using SAS® to Manipulate Large-scale Data in Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Extensible Storage Engine wikipedia , lookup

Clusterpoint wikipedia , lookup

Object-relational impedance mismatch wikipedia , lookup

Database model wikipedia , lookup

Open Database Connectivity wikipedia , lookup

Transcript
NESUG 2012
Large Data Sets
Tips for Using SAS® to Manipulate Large-scale Data in Databases
Shih-Ching Wu, Virginia Tech Transportation Institute, Blacksburg, VA
Shane McLaughlin, Virginia Tech Transportation Institute, Blacksburg, VA
ABSTRACT
SAS programmers sometimes run into difficulty when transitioning from working with small databases to working
with large data sets. There are two common issues that arise. One is that the data set is too large to move
around during processing. In this case, the SAS log might show error messages such as “out-of-memory” or
“fetch error.” A second issue arises when the SAS program will run, but it takes a very long time to finish, such as
days, weeks, or longer. This situation is problematic because it exposes the project to risks from network connection failures, database service interruptions or power outages, and if in the end, results are in error, the user must
start an already long process over.
The first part of this paper provides a SAS Macro solution to break down a large problem into smaller pieces. This
can solve issues created by code that attempts to handle too much data. The second part of the paper provides
helpful tips to monitor, manage, and time lengthy SAS programs. It includes demonstrations of how to utilize the
SAS log to monitor an on-going program, how to resume an interrupted SAS program, how to manage progress or
results of a lengthy SAS program run on single or multiple machines in support of database systems, and how to
identify system bottlenecks and evaluate performance of SAS programs and overall computing infrastructure. The
paper provides examples using SAS with a PostgreSQL database system.
INTRODUCTION
The Virginia Tech Transportation Institute (VTTI) has been conducting several multi-terabyte collections of naturalistic driving study (NDS) data. These studies collect continuous data from vehicles as they are driven by research
participants in their daily activities over one or two years. The variety of information and large amount of data have
become an important source of guidance for transportation research. After collecting data from the field, post data
collection processes involve in working with large databases which create programming challenges and difficulties.
These tasks include data quality checks, mapping GPS traces onto digital maps [1], summarizing trip statistics, or
reducing data into a format for research analyses. To tackle issues brought on by extremely large data sets, this
paper first demonstrates usage of a SAS Macro program to automatically break down a large computational task
into smaller pieces. Second, this paper introduces process management solutions to monitor and time lengthy
SAS programs running on single or multiple machines in support of databases, identify system bottlenecks, and
evaluate performance of SAS programs and overall computing infrastructure.
EXAMPLE OF A POST DATA COLLECTION PROCESS
Throughout this paper, we will use a simplified example to facilitate description of proposed ideas. Readers could
apply the ideas to solve more complex computational tasks run on single or multiple machines. Consider the following three database tables – “tracking_mile_traveled (Figure 1a),” “nds_data (Figure 1b),” and “nds_result (Figure 1c)” shown in the following figure from left to right respectively and an example task – calculating distance
traveled for each trip.
(a)
(b)
(c)
Figure 1. Example Data Sets
1
NESUG 2012
Large Data Sets
Figure 1a, “tracking_mile_traveled,” contains a list of trips from a NDS study. “file_id” is a unique identifier for
each single trip. Figure 1b, “nds_data,” contains collected time series data including time-stamped speed and distance data. The column, “distance_feet,” records distance a vehicle traveled during a time period from previous to
current timestamps. An example of post data collection tasks is to calculate total distance traveled for all files.
The last table, “nds_result (Figure 1c),” stores results of this computational task. The following SAS sample code
demonstrates how to automate this two-step distance calculation example via the SAS PROC SQL procedure.
Fig. 2.0
/* --- setup odbc connection --- */
libname dbconn2 odbc dsn = 'PostgreSQL_Guidance';
/* --- example of a post data collection process --- */
proc sql noprint;
/* --- step 1 --- */
/* --- calculate distance traveled for all file --- */
create table result_temp as
select file_id, sum(distance_feet) as tot_dist_ft
from dbconn2.nds_data
group by file_id;
/* --- step 2 --- */
/* --- save previous computational results into a database table --- */
insert into dbconn2.nds_result(file_id, tot_dist_ft)
select file_id, tot_dist_ft
from result_temp;
quit;
ISSUE OF LARGE DATA SIZE
In today’s data centric business and research climate, it is inevitable that programmers will experience challenges
when manipulating large data sets. As an example, the NDS data for one study contains millions of trips, creating
numbers of records within one measure ranging from five billion (GPS speed) to 60 billion (longitudinal acceleration). If a SAS program attempts to query all GPS points, the process will fail. The following SAS logs show some
error messages occurred by attempts to process too large amount of data against databases at once.
Figure 3. Query Too Large Amount of Data from Databases
2
NESUG 2012
Large Data Sets
Figure 4. Insert Too Large Amount of Data into Databases
A common programming practice to resolve problems shown in Figures 3, and 4 is processing smaller pieces of a
data set. This action divides one data process into several repetitive processes which can be automated via SAS
Macro. In the previous two-step distance calculation example (Figure 2), if the whole data set is too large, both
processes, step one and two, will fail and similar error messages as shown in Figure 3 and 4 will appear on the
SAS log. However, this issue can be solved, if a SAS program processes one file at a time and then repeats the
process until all files are complete. The following SAS Macro sample code divides the entire data process up according to file, then loops through each, and performs the two-step process.
Fig. 5.0
dm "out;clear;log;clear;";
/* --- setup odbc connection --- */
%let dbconn = dsn ='PostgreSQL_Guidance'; * (A);
libname dbconn2 odbc dsn = 'PostgreSQL_Guidance'; * (A);
/* --- create macro variables --- */
proc sql noprint;
connect to odbc (&dbconn); * (A);
select count into: file_tot from connection to odbc /* (B) */
(
select count(*) as count
from tracking_mile_traveled;
);
select file_id into: file_id1-:file_id%sysfunc(trim(%sysfunc(left(&file_tot))))
from connection to odbc /* (C) */
(
select file_id
from tracking_mile_traveled;
);
disconnect from odbc;
quit;
/* --- SAS Macro ---*/
%macro large_data;
%do k = 1 %to &file_tot; %* (D);
%* task;
proc sql noprint;
insert into dbconn2.nds_result(file_id, tot_dist_ft)
select file_id, sum(distance_feet) as tot_dist_ft
from dbconn2.nds_data
where file_id = &&file_id&k %* (E);
group by file_id;
quit;
%end; %* (D);
%mend large_data;
%large_data /* (F) */
In the above code example, Part A lets SAS access data stored in the PostgreSQL database system [2] via an
ODBC connection [3]. Part B creates a macro variable, “file_tot,” which records the count of total number of files
3
NESUG 2012
Large Data Sets
which are needed to be processed. Part C dynamically assigns “file_id” values to multiple macro variables sequentially named “file_id1,” “file_id2,” “file_id3,” etc. Both Part B and C create macro variables in preparation for
the following SAS Macro program. Part D demonstrates usage of a do loop to process data from one file to the
next. Within each iteration of the programming loop, Part E limits the two-step computational task to one trip. Part
F executes the SAS Macro program. Figure 6 below shows a sample of SAS macro variables created by the
above SAS sample code.
Figure 6. A Sample of SAS Macro Variables
Within this dynamic programming technique, as the number of tasks increases, it becomes important to have a
surrounding process to facilitate monitoring the individual tasks or batches of tasks. The next section of this paper
describes these types of techniques.
***************
4
NESUG 2012
Large Data Sets
ISSUE OF LONG PROCCESS TIME
In addition to the capability of using SAS to handle extremely large data sets, having an ability to manage lengthy
SAS programs is also crucial. Re-starting an already long process carries a large penalty, so it is important to ensure output is as expected in the beginning, rather than waiting until a process ends. Also, if re-processing is
needed, it is best if programmers can easily identify what has been done and what needs to be redone. In order to
reach these goals, this section focuses on how to utilize the SAS log and database tables to track a long SAS program. First, writing user developed notes on SAS logs helps programmers track progress of a long process. If
errors occur, those notes will be useful clues for programmers to identify where SAS programs went wrong. Also,
creation of a database table specifically for tracking status of SAS programs provides a convenient solution to
manage long processes. The following tables are the revised “tracking_mile_traveled (Figure 1a)” and “nds_result
(Figure 1c)” tables which are introduced previously in this paper.
(a)
(b)
Figure 7. Additional Attributes for Tracking SAS Programs
In Figure 7a, “tracking_mile_traveled,” there are four additional attributes. “Process_order” records sequence of
data processes. “Process_status” denotes if a file is not yet processed, in process, or done. “Total_time” records
time elapsed to complete the computation for each file. “Worker” records information regarding which machine is
used while processing. Another common term for this is “Agent.” In Figure 7b, “nds_result,” there is one additional column – “write_time” which records date and time when results have inserted into the database table. In both
database tables, columns, “process_order,” “process_status,” and “write_time” allow programmers to track progress of data processes. They are especially useful when errors are found in results and provide guidelines for
users to identify where to re-process partial data to correct bad results. Time when SAS programs are interrupted
due to network connection failures or other reasons will also be captured by these three columns. For example, if
this is the case, “process_status” would be indicating “in process” and will never change to “done.” The “total_time,” and “worker” columns help users to understand efficiency of SAS code to complete a task and performance of specific machines. These measures permit the user to identify system bottlenecks, such as from a network, database queries, or machine capacity, or memory “leaks.” The measures can also be used to reveal transient performance issues, such as a machine that completes tasks at one rate after hours and another rate during
the workday due to shared use. The following SAS Macro sample code (Figure 8 and 9) is the extension from
previous example (Figure 5) with implementations of tracking techniques in the SAS log and database tables.
5
NESUG 2012
Large Data Sets
Fig. 8.0
proc printto
log = 'U:\fac_staff\scwu\sas_tracking.log'; /* (G) */
run;
dm "out;clear;log;clear;";
/* --- setup odbc connection --- */
%let dbconn = dsn ='PostgreSQL_Guidance';
libname dbconn2 odbc dsn = 'PostgreSQL_Guidance';
/* --- count total number of files --- */
proc sql noprint;
connect to odbc (&dbconn);
select count into: file_tot from connection to odbc /* (H) */
(
select count(*) as count
from tracking_mile_traveled
where process_status = 0; /* (H) */
);
disconnect from odbc;
quit;
%put --- total number of files: %sysfunc(trim(%sysfunc(left(&file_tot)))) ---; /* (I) */
Figure 8 contains settings in preparation for utilizing SAS Macro program to run repetitive processes files by files.
Part G saves the SAS log into a log file. Part H counts the number of files that have not yet been processed and
dynamically assigns the count to a macro variable, “file_tot,” which will feed the following SAS Macro program
(Figure 9) as a SAS DO LOOP counter. Part I creates a user note on SAS log based on information collected
from Part H as shown in Figure 10.
6
NESUG 2012
Large Data Sets
Fig. 9.0
* --- SAS Macro ---;
%macro large_data;
%do k = 1 %to &file_tot;
%* randomly pick a file_id for processing;
proc sql noprint;
connect to odbc(&dbconn);
select file_id into: file_id
from connection to odbc
(
select * from tracking_mile_traveled
where process_status = 0
order by random() /* (J) */
limit 1;
);
disconnect from odbc;
quit;
%* write to SAS log;
%put --- iteration: &k processing file_id:
%sysfunc(trim(%sysfunc(left(&file_id)))) ---; /* (K) */
%* assign 1 to "process_status" – in process;
proc sql noprint;
update dbconn2.tracking_mile_traveled
set process_order = &k, process_status = 1,
worker = "&SYSHOSTNAME" /* (L) */
where file_id = &file_id;
quit;
%* record start time of process;
%let t_start=%sysfunc(time(), time11.2); /* (M) */
%* computational task;
proc sql noprint;
insert into dbconn2.nds_result(file_id, tot_dist_ft)
select file_id, sum(distance_feet) as tot_dist_ft
from dbconn2.nds_data
where file_id = &file_id
group by file_id;
quit;
%* record end time of process;
%let t_end=%sysfunc(time(), time11.2); /* (M) */
%* calculate time elapsed for processing one file;
data _null_;
total_time = "&t_end"t - "&t_start"t; /* (N) */
call symputx('total_time',total_time);
run;
%* assign 2 to "process_status" - done;
proc sql noprint;
update dbconn2.tracking_mile_traveled
set total_time = &total_time, process_status = 2 /* (O) */
where file_id = &file_id;
quit;
%end;
%mend large_data;
%large_data
7
NESUG 2012
Large Data Sets
Inside the SAS Marco program (Figure 9), Part J randomly selects a “file_id” to operate on in each iteration of the
process. This SAS PROC SQL procedure can be modified for prioritizing processes. For example, a selection of
“file_id” can start from picking files needed to be processed first, such as files from a specific research participant.
Once a file is picked, Part K writes the file information on the SAS log which separates each iteration of the process and provides a more readable format of the SAS log (Figure 10). Part L changes the status of the process
attribute to “in-process” and records the machine information. Part M records start and end time for execution of
the task. Based on information from Part M, Part N calculates time required for SAS to perform the computation
for one file. Last, Part O changes status of process to “done” for the file, which excludes the file from being picked
again at the first step of the SAS Macro program.
Figure 10. Sample of the SAS Log
TRACKING PROGRESS OF A SAS PROGRAM
Once the SAS program starts running, the SAS log and database tables will record progress of the process. The
following PROC SQL procedure provides an example of how to track progress and results of the SAS program.
Fig. 11.0
proc sql;
connect to odbc (&dbconn);
title 'Status of Process';
select * from connection to odbc /* (P) */
(
select process_status, count(file_id) as count_num_of_files
from tracking_mile_traveled
group by process_status
);
title 'System Performance';
select * from connection to odbc /* (Q) */
(
select worker, count(file_id), avg(total_time) as avg_process_time_per_file
from tracking_mile_traveled
where process_status <> 0
group by worker
);
8
disconnect from odbc;
quit;
NESUG 2012
Large Data Sets
In Figure 11, Part P checks total number of files that have not yet been processed (process_status=0), in process
(process_status=1), and those that are done (process_status=2). Part Q checks number of files processed and
average time for SAS to finish the calculation task on each machine. As discussed previously, this information
helps users to evaluate efficiency of both their code and overall computing infrastructure. SAS output is shown in
Figure 12.
Figure 12. Tracking SAS Programs
Other sub-measures, such as query time or insert time as a ratio of file size can also be easily incorporated. In
Figure 13, Part R queries results inserted into database during a specific time period and their corresponding
tracking information. If a SAS program is interrupted during a time period, this information provides a starting
point for programmers to check results of which “file_id” potential have problems. Similarly, “process_order,”
might be another indicator for checking which batch of processes goes wrong. SAS output is shown in Figure 14.
Fig. 13.0
proc sql;
connect to odbc (&dbconn);
title 'Results Generated During Specific Time Period';
select * from connection to odbc /* (R) */
(
select ta.*, tb.tot_dist_ft, tb.write_time
from tracking_mile_traveled as ta
left join nds_result as tb on (ta.file_id=tb.file_id)
where tb.write_time >= '2012-09-06 10:00:00'
and tb.write_time <= '2012-09-06 15:15:00'
);
disconnect from odbc;
quit;
9
NESUG 2012
Large Data Sets
Figure 14. Tracking SAS Programs
CONCLUSIONS
Analyzing large data sets has it challenges, particularly while transitioning from a desktop analysis paradigm.
However, some simple processing and monitoring strategies can make the work manageable. With SAS Macro,
programmers can easily split a large data set into multiple smaller pieces and code repetitive processes. Using a
tracking database table and writing user notes on the SAS log helps programmers track progress of lengthy programs, identify bottlenecks and problematic infrastructure, and identify and recover from problems early.
REFERENCES:
1. Wu, Shih-Ching and McLaughlin, Shane. 2012. “Creating a Heatmap Visualization of 150 Million GPS Points
on Roadway Maps via SAS®“. the SouthEast SAS Users Group (SESUG) 2012 Conference.
2. PostgreSQL. http://www.postgresql.org/.
3. SAS Institute, 2011, SAS/ACCESS® 9.3 for Relational Databases Reference. Cary, NC: SAS Institute, Inc.
http://support.sas.com/documentation/cdl/en/acreldb/63144/PDF/default/acreldb.pdf.
ACKNOWLEDGMENTS
The authors would like to acknowledge the research participants and external contributions which made this work
possible. The methods developed and described in this paper were developed with funds from the National Surface Transportation Safety Center for Excellence.
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS
Institute Inc. in the USA and other countries. ® indicates USA registration.
Other brand and product names are registered trademarks or trademarks of their respective companies.
CONTACT INFORMATION
Your comments and questions are valued and encouraged. Contact the authors at:
Shih-Ching Wu
Virginia Tech Transportation Institute
3500 Transportation Research Plaza
Blacksburg, VA, 24061
540-231-1091
[email protected]
http://www.vtti.vt.edu
Shane McLaughlin
Virginia Tech Transportation Institute
3500 Transportation Research Plaza
Blacksburg, VA, 24061
540-231-1077
[email protected]
http://www.vtti.vt.edu
************************************************
10