Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NESUG 2012 Large Data Sets Tips for Using SAS® to Manipulate Large-scale Data in Databases Shih-Ching Wu, Virginia Tech Transportation Institute, Blacksburg, VA Shane McLaughlin, Virginia Tech Transportation Institute, Blacksburg, VA ABSTRACT SAS programmers sometimes run into difficulty when transitioning from working with small databases to working with large data sets. There are two common issues that arise. One is that the data set is too large to move around during processing. In this case, the SAS log might show error messages such as “out-of-memory” or “fetch error.” A second issue arises when the SAS program will run, but it takes a very long time to finish, such as days, weeks, or longer. This situation is problematic because it exposes the project to risks from network connection failures, database service interruptions or power outages, and if in the end, results are in error, the user must start an already long process over. The first part of this paper provides a SAS Macro solution to break down a large problem into smaller pieces. This can solve issues created by code that attempts to handle too much data. The second part of the paper provides helpful tips to monitor, manage, and time lengthy SAS programs. It includes demonstrations of how to utilize the SAS log to monitor an on-going program, how to resume an interrupted SAS program, how to manage progress or results of a lengthy SAS program run on single or multiple machines in support of database systems, and how to identify system bottlenecks and evaluate performance of SAS programs and overall computing infrastructure. The paper provides examples using SAS with a PostgreSQL database system. INTRODUCTION The Virginia Tech Transportation Institute (VTTI) has been conducting several multi-terabyte collections of naturalistic driving study (NDS) data. These studies collect continuous data from vehicles as they are driven by research participants in their daily activities over one or two years. The variety of information and large amount of data have become an important source of guidance for transportation research. After collecting data from the field, post data collection processes involve in working with large databases which create programming challenges and difficulties. These tasks include data quality checks, mapping GPS traces onto digital maps [1], summarizing trip statistics, or reducing data into a format for research analyses. To tackle issues brought on by extremely large data sets, this paper first demonstrates usage of a SAS Macro program to automatically break down a large computational task into smaller pieces. Second, this paper introduces process management solutions to monitor and time lengthy SAS programs running on single or multiple machines in support of databases, identify system bottlenecks, and evaluate performance of SAS programs and overall computing infrastructure. EXAMPLE OF A POST DATA COLLECTION PROCESS Throughout this paper, we will use a simplified example to facilitate description of proposed ideas. Readers could apply the ideas to solve more complex computational tasks run on single or multiple machines. Consider the following three database tables – “tracking_mile_traveled (Figure 1a),” “nds_data (Figure 1b),” and “nds_result (Figure 1c)” shown in the following figure from left to right respectively and an example task – calculating distance traveled for each trip. (a) (b) (c) Figure 1. Example Data Sets 1 NESUG 2012 Large Data Sets Figure 1a, “tracking_mile_traveled,” contains a list of trips from a NDS study. “file_id” is a unique identifier for each single trip. Figure 1b, “nds_data,” contains collected time series data including time-stamped speed and distance data. The column, “distance_feet,” records distance a vehicle traveled during a time period from previous to current timestamps. An example of post data collection tasks is to calculate total distance traveled for all files. The last table, “nds_result (Figure 1c),” stores results of this computational task. The following SAS sample code demonstrates how to automate this two-step distance calculation example via the SAS PROC SQL procedure. Fig. 2.0 /* --- setup odbc connection --- */ libname dbconn2 odbc dsn = 'PostgreSQL_Guidance'; /* --- example of a post data collection process --- */ proc sql noprint; /* --- step 1 --- */ /* --- calculate distance traveled for all file --- */ create table result_temp as select file_id, sum(distance_feet) as tot_dist_ft from dbconn2.nds_data group by file_id; /* --- step 2 --- */ /* --- save previous computational results into a database table --- */ insert into dbconn2.nds_result(file_id, tot_dist_ft) select file_id, tot_dist_ft from result_temp; quit; ISSUE OF LARGE DATA SIZE In today’s data centric business and research climate, it is inevitable that programmers will experience challenges when manipulating large data sets. As an example, the NDS data for one study contains millions of trips, creating numbers of records within one measure ranging from five billion (GPS speed) to 60 billion (longitudinal acceleration). If a SAS program attempts to query all GPS points, the process will fail. The following SAS logs show some error messages occurred by attempts to process too large amount of data against databases at once. Figure 3. Query Too Large Amount of Data from Databases 2 NESUG 2012 Large Data Sets Figure 4. Insert Too Large Amount of Data into Databases A common programming practice to resolve problems shown in Figures 3, and 4 is processing smaller pieces of a data set. This action divides one data process into several repetitive processes which can be automated via SAS Macro. In the previous two-step distance calculation example (Figure 2), if the whole data set is too large, both processes, step one and two, will fail and similar error messages as shown in Figure 3 and 4 will appear on the SAS log. However, this issue can be solved, if a SAS program processes one file at a time and then repeats the process until all files are complete. The following SAS Macro sample code divides the entire data process up according to file, then loops through each, and performs the two-step process. Fig. 5.0 dm "out;clear;log;clear;"; /* --- setup odbc connection --- */ %let dbconn = dsn ='PostgreSQL_Guidance'; * (A); libname dbconn2 odbc dsn = 'PostgreSQL_Guidance'; * (A); /* --- create macro variables --- */ proc sql noprint; connect to odbc (&dbconn); * (A); select count into: file_tot from connection to odbc /* (B) */ ( select count(*) as count from tracking_mile_traveled; ); select file_id into: file_id1-:file_id%sysfunc(trim(%sysfunc(left(&file_tot)))) from connection to odbc /* (C) */ ( select file_id from tracking_mile_traveled; ); disconnect from odbc; quit; /* --- SAS Macro ---*/ %macro large_data; %do k = 1 %to &file_tot; %* (D); %* task; proc sql noprint; insert into dbconn2.nds_result(file_id, tot_dist_ft) select file_id, sum(distance_feet) as tot_dist_ft from dbconn2.nds_data where file_id = &&file_id&k %* (E); group by file_id; quit; %end; %* (D); %mend large_data; %large_data /* (F) */ In the above code example, Part A lets SAS access data stored in the PostgreSQL database system [2] via an ODBC connection [3]. Part B creates a macro variable, “file_tot,” which records the count of total number of files 3 NESUG 2012 Large Data Sets which are needed to be processed. Part C dynamically assigns “file_id” values to multiple macro variables sequentially named “file_id1,” “file_id2,” “file_id3,” etc. Both Part B and C create macro variables in preparation for the following SAS Macro program. Part D demonstrates usage of a do loop to process data from one file to the next. Within each iteration of the programming loop, Part E limits the two-step computational task to one trip. Part F executes the SAS Macro program. Figure 6 below shows a sample of SAS macro variables created by the above SAS sample code. Figure 6. A Sample of SAS Macro Variables Within this dynamic programming technique, as the number of tasks increases, it becomes important to have a surrounding process to facilitate monitoring the individual tasks or batches of tasks. The next section of this paper describes these types of techniques. *************** 4 NESUG 2012 Large Data Sets ISSUE OF LONG PROCCESS TIME In addition to the capability of using SAS to handle extremely large data sets, having an ability to manage lengthy SAS programs is also crucial. Re-starting an already long process carries a large penalty, so it is important to ensure output is as expected in the beginning, rather than waiting until a process ends. Also, if re-processing is needed, it is best if programmers can easily identify what has been done and what needs to be redone. In order to reach these goals, this section focuses on how to utilize the SAS log and database tables to track a long SAS program. First, writing user developed notes on SAS logs helps programmers track progress of a long process. If errors occur, those notes will be useful clues for programmers to identify where SAS programs went wrong. Also, creation of a database table specifically for tracking status of SAS programs provides a convenient solution to manage long processes. The following tables are the revised “tracking_mile_traveled (Figure 1a)” and “nds_result (Figure 1c)” tables which are introduced previously in this paper. (a) (b) Figure 7. Additional Attributes for Tracking SAS Programs In Figure 7a, “tracking_mile_traveled,” there are four additional attributes. “Process_order” records sequence of data processes. “Process_status” denotes if a file is not yet processed, in process, or done. “Total_time” records time elapsed to complete the computation for each file. “Worker” records information regarding which machine is used while processing. Another common term for this is “Agent.” In Figure 7b, “nds_result,” there is one additional column – “write_time” which records date and time when results have inserted into the database table. In both database tables, columns, “process_order,” “process_status,” and “write_time” allow programmers to track progress of data processes. They are especially useful when errors are found in results and provide guidelines for users to identify where to re-process partial data to correct bad results. Time when SAS programs are interrupted due to network connection failures or other reasons will also be captured by these three columns. For example, if this is the case, “process_status” would be indicating “in process” and will never change to “done.” The “total_time,” and “worker” columns help users to understand efficiency of SAS code to complete a task and performance of specific machines. These measures permit the user to identify system bottlenecks, such as from a network, database queries, or machine capacity, or memory “leaks.” The measures can also be used to reveal transient performance issues, such as a machine that completes tasks at one rate after hours and another rate during the workday due to shared use. The following SAS Macro sample code (Figure 8 and 9) is the extension from previous example (Figure 5) with implementations of tracking techniques in the SAS log and database tables. 5 NESUG 2012 Large Data Sets Fig. 8.0 proc printto log = 'U:\fac_staff\scwu\sas_tracking.log'; /* (G) */ run; dm "out;clear;log;clear;"; /* --- setup odbc connection --- */ %let dbconn = dsn ='PostgreSQL_Guidance'; libname dbconn2 odbc dsn = 'PostgreSQL_Guidance'; /* --- count total number of files --- */ proc sql noprint; connect to odbc (&dbconn); select count into: file_tot from connection to odbc /* (H) */ ( select count(*) as count from tracking_mile_traveled where process_status = 0; /* (H) */ ); disconnect from odbc; quit; %put --- total number of files: %sysfunc(trim(%sysfunc(left(&file_tot)))) ---; /* (I) */ Figure 8 contains settings in preparation for utilizing SAS Macro program to run repetitive processes files by files. Part G saves the SAS log into a log file. Part H counts the number of files that have not yet been processed and dynamically assigns the count to a macro variable, “file_tot,” which will feed the following SAS Macro program (Figure 9) as a SAS DO LOOP counter. Part I creates a user note on SAS log based on information collected from Part H as shown in Figure 10. 6 NESUG 2012 Large Data Sets Fig. 9.0 * --- SAS Macro ---; %macro large_data; %do k = 1 %to &file_tot; %* randomly pick a file_id for processing; proc sql noprint; connect to odbc(&dbconn); select file_id into: file_id from connection to odbc ( select * from tracking_mile_traveled where process_status = 0 order by random() /* (J) */ limit 1; ); disconnect from odbc; quit; %* write to SAS log; %put --- iteration: &k processing file_id: %sysfunc(trim(%sysfunc(left(&file_id)))) ---; /* (K) */ %* assign 1 to "process_status" – in process; proc sql noprint; update dbconn2.tracking_mile_traveled set process_order = &k, process_status = 1, worker = "&SYSHOSTNAME" /* (L) */ where file_id = &file_id; quit; %* record start time of process; %let t_start=%sysfunc(time(), time11.2); /* (M) */ %* computational task; proc sql noprint; insert into dbconn2.nds_result(file_id, tot_dist_ft) select file_id, sum(distance_feet) as tot_dist_ft from dbconn2.nds_data where file_id = &file_id group by file_id; quit; %* record end time of process; %let t_end=%sysfunc(time(), time11.2); /* (M) */ %* calculate time elapsed for processing one file; data _null_; total_time = "&t_end"t - "&t_start"t; /* (N) */ call symputx('total_time',total_time); run; %* assign 2 to "process_status" - done; proc sql noprint; update dbconn2.tracking_mile_traveled set total_time = &total_time, process_status = 2 /* (O) */ where file_id = &file_id; quit; %end; %mend large_data; %large_data 7 NESUG 2012 Large Data Sets Inside the SAS Marco program (Figure 9), Part J randomly selects a “file_id” to operate on in each iteration of the process. This SAS PROC SQL procedure can be modified for prioritizing processes. For example, a selection of “file_id” can start from picking files needed to be processed first, such as files from a specific research participant. Once a file is picked, Part K writes the file information on the SAS log which separates each iteration of the process and provides a more readable format of the SAS log (Figure 10). Part L changes the status of the process attribute to “in-process” and records the machine information. Part M records start and end time for execution of the task. Based on information from Part M, Part N calculates time required for SAS to perform the computation for one file. Last, Part O changes status of process to “done” for the file, which excludes the file from being picked again at the first step of the SAS Macro program. Figure 10. Sample of the SAS Log TRACKING PROGRESS OF A SAS PROGRAM Once the SAS program starts running, the SAS log and database tables will record progress of the process. The following PROC SQL procedure provides an example of how to track progress and results of the SAS program. Fig. 11.0 proc sql; connect to odbc (&dbconn); title 'Status of Process'; select * from connection to odbc /* (P) */ ( select process_status, count(file_id) as count_num_of_files from tracking_mile_traveled group by process_status ); title 'System Performance'; select * from connection to odbc /* (Q) */ ( select worker, count(file_id), avg(total_time) as avg_process_time_per_file from tracking_mile_traveled where process_status <> 0 group by worker ); 8 disconnect from odbc; quit; NESUG 2012 Large Data Sets In Figure 11, Part P checks total number of files that have not yet been processed (process_status=0), in process (process_status=1), and those that are done (process_status=2). Part Q checks number of files processed and average time for SAS to finish the calculation task on each machine. As discussed previously, this information helps users to evaluate efficiency of both their code and overall computing infrastructure. SAS output is shown in Figure 12. Figure 12. Tracking SAS Programs Other sub-measures, such as query time or insert time as a ratio of file size can also be easily incorporated. In Figure 13, Part R queries results inserted into database during a specific time period and their corresponding tracking information. If a SAS program is interrupted during a time period, this information provides a starting point for programmers to check results of which “file_id” potential have problems. Similarly, “process_order,” might be another indicator for checking which batch of processes goes wrong. SAS output is shown in Figure 14. Fig. 13.0 proc sql; connect to odbc (&dbconn); title 'Results Generated During Specific Time Period'; select * from connection to odbc /* (R) */ ( select ta.*, tb.tot_dist_ft, tb.write_time from tracking_mile_traveled as ta left join nds_result as tb on (ta.file_id=tb.file_id) where tb.write_time >= '2012-09-06 10:00:00' and tb.write_time <= '2012-09-06 15:15:00' ); disconnect from odbc; quit; 9 NESUG 2012 Large Data Sets Figure 14. Tracking SAS Programs CONCLUSIONS Analyzing large data sets has it challenges, particularly while transitioning from a desktop analysis paradigm. However, some simple processing and monitoring strategies can make the work manageable. With SAS Macro, programmers can easily split a large data set into multiple smaller pieces and code repetitive processes. Using a tracking database table and writing user notes on the SAS log helps programmers track progress of lengthy programs, identify bottlenecks and problematic infrastructure, and identify and recover from problems early. REFERENCES: 1. Wu, Shih-Ching and McLaughlin, Shane. 2012. “Creating a Heatmap Visualization of 150 Million GPS Points on Roadway Maps via SAS®“. the SouthEast SAS Users Group (SESUG) 2012 Conference. 2. PostgreSQL. http://www.postgresql.org/. 3. SAS Institute, 2011, SAS/ACCESS® 9.3 for Relational Databases Reference. Cary, NC: SAS Institute, Inc. http://support.sas.com/documentation/cdl/en/acreldb/63144/PDF/default/acreldb.pdf. ACKNOWLEDGMENTS The authors would like to acknowledge the research participants and external contributions which made this work possible. The methods developed and described in this paper were developed with funds from the National Surface Transportation Safety Center for Excellence. SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are registered trademarks or trademarks of their respective companies. CONTACT INFORMATION Your comments and questions are valued and encouraged. Contact the authors at: Shih-Ching Wu Virginia Tech Transportation Institute 3500 Transportation Research Plaza Blacksburg, VA, 24061 540-231-1091 [email protected] http://www.vtti.vt.edu Shane McLaughlin Virginia Tech Transportation Institute 3500 Transportation Research Plaza Blacksburg, VA, 24061 540-231-1077 [email protected] http://www.vtti.vt.edu ************************************************ 10