* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Optimization Techniques for Processing Large Files in an OpenVMS
Survey
Document related concepts
Transcript
Optimization Techniques for Processing Large Files in an OpenVMS Environment Adrien Ndikumwami, Regional Economic Studies Institute, Towson, MD system are discussed. Wherever appropriate, several examples are provided to demonstrate the use of these techniques. Costs and benefits with regard to the use of computer resources (I/O, disk storage, CPU and memory) are also discussed for each optimization technique. ABSTRACT Despite the recent technological advances being made in the computer industry, both in terms of availability of enhanced processing power and large storage devices, the demand for these resources is also increasing considerably. Unlike in the past, large data files spanning several gigabytes are now being routinely processed for generating regular and ad hoc reports. Such regular use of large files is forcing programmers to look for new techniques that enable optimal use of scarce computer resources. While the SAS® system provides advanced capabilities to handle and manipulate large files, an appreciation of the host operating system by the programmer is essential to efficiently use the scarce computer resources. For large data files, the way data are read, processed, and stored can make a lot of difference in the optimal utilization of resources. This paper examines the issues that a SAS programmer needs to address while processing large files in an OpenVMS environment. Optimization techniques that are related to creating and setting datasets, sorting, indexing, compressing, running jobs in batch mode, and certain other host specific issues are explored as they are applicable to an OpenVMS environment. II. OPTIMIZATION TECHNIQUES THAT ARE RELEVANT TO OTHER OPERATING SYSTEMS 1. Reducing dataset observation length Reducing a dataset observation length reduces I/O and thereby shortens the elapsed time as the amount of input data decreases. There are several methods that help decrease the length of an observation. The more popular techniques are the DROP= and KEEP= dataset options, the LENGTH statement and the data compression feature. a. KEEP= AND DROP= Data Options While creating a dataset, you should only keep variables needed for further or future processing. You can accomplish this by using the KEEP= or the DROP= statements. By reducing the number of variables in a large dataset, you are actually reducing the amount of data to be read or written. This results in significant reduction of I/O and elapsed time. This technique is beneficial when the variables being dropped are no longer needed or can be recreated using the remaining variables. If unsure whether to retain or drop certain variables, it is recommended that you keep them. The cost of recreating dropped variables far outweighs the benefit of keeping them. Assuming dataset ONE has variables A, B, C, X, Y, Z the following examples demonstrate the use of DROP= and KEEP= options: I. INTRODUCTION It seems paradoxical that despite the availability of higher processing power and larger storage space, there is an ever growing need for these resources. The reality is that unlike in the past, large data files spanning several gigabytes are now being routinely processed for generating regular and ad hoc reports. Most of these operational data files could not be used on a regular basis due to computing resource constraints. The proliferation of large data files and the increasing need for accessing such files on a regular basis for ad-hoc reporting, micro-simulation, trend analysis and data mining activities pose new challenges to the programmer. DATA TWO; SET ONE (DROP=C Z); RUN; or DATA TWO; SET ONE (KEEP=A B X Y); RUN; Programmers constantly look for alternative methods to make optimal use of existing computer resources. While the SAS system provides advanced capabilities to handle and manipulate large files, an appreciation of the host operating system by the programmer is essential to efficiently manage scarce computer resources. While some of the techniques discussed in this paper are applicable to all operating systems, the focus of this paper is on processing large data files in an OpenVMS environment. b. The LENGTH statement LENGTH statements may be used in a data step to reduce the number of bytes used for storage. The default length of numeric variables on SAS system is 8 bytes. Often, disk storage is wasted because a smaller length could be used to store the same value stored in an 8-bytes without Optimization techniques that are relevant to most operating systems are initially explored along with their OpenVMSspecific features. In the later sections, certain techniques that are either unique or more relevant in an OpenVMS 1 compromising on precision . See Table 1 for the largest integer that can be represented for each length. Using the length statement can significantly reduce the storage space although the CPU time may slightly increase. This feature comes handy when a dataset contains numeric variables with small values. Using the LENGTH statement is not recommended for storing fractions of numeric values because the precision may be lost. LIBNAME X ‘[ ]’; DATA X.BIGDATA (COMPRESS=YES); INPUT VAR1-VAR100 $200.; RUN; 2. Sorting Sorting may consume a considerable amount of computing resources. In many cases, the program will not work because resources have not been allocated properly. By default sorting on AXP OpenVMS is routed to SYS$SCRATCH which in turn points to SYS$LOGIN. Typically, SYS$LOGIN resides on a quota-enabled disk that may not be have enough space to sort a large dataset. For large datasets, it is advisable to use the host sort. Sorting takes about 2 to 3 times the size of the input dataset. If your LOGIN disk has a small quota and you need to process a large file, the best way is to point SYS$SCRATCH to a disk with more space. The following command defines another sort area and should be included in the LOGIN.COM: Table 1: Largest integer that can be stored by SAS variables on AXP OpenVMS by variable length Length (bytes) Largest integer represented exactly 3 4 5 6 7 8 8,191 2,097,151 536,870,911 137,438,953,471 35,184,372,088,831 9,007,199,254,740,991 $DEFINE/PROCESS SYS$SCRATCH DKA102:[SORTAREA] The following is an example on how to use LENGTH statement: From AXP OpenVMS maintenance release TS048 of SAS system, a new SORTWORK option has been added that directs the SAS System to create up to 6 SORTWORK logicals that will be used by the OpenVMS host sort for temporary work space. The SORTWORK option is used as follows: DATA ONE; LENGTH X Y Z 4; INPUT X Y Z; RUN; c. Dataset Compression LIBNAME SWORK1 DKB2:[SORTAREA1]; LIBNAME SWORK2 DKB2:[SORTAREA1]; The SAS system is equipped with a powerful compression algorithm. This function treats an entire observation as a string of bytes. It ignores variable types and boundaries. Data compression is more helpful whenever a dataset has repeating numeric or character data. Compression reduces the I/O but uses more CPU time. In addition, disk storage space is reduced by the compression factor. If your site does not have many users or does not carry other CPU-intensive applications, or the CPU time is free (no monetary charge for using the CPU), then compression is an ideal technique. Over the last decade, the speed of OpenVMS processors has increased by a factor more than 20, while at the same time, disk I/O systems have merely sped up by a factor of 2. Shifting the I/O pressure to the CPU is the best way to solve this problem. However, if the dataset does not have many repeating values, you should avoid compressing because performance may get worse. Under certain circumstances, SAS Compression may actually increase the size of a large dataset. In this case the I/Os, the CPU time and elapsed time increase. You can use SAS compression as a global option as in the following example: OPTIONS SORTWORK=(SWORK1, SWORK2) ; The SORTWORK option accepts both SAS librefs assigned in a LIBNAME statement or an OpenVMS path that must be enclosed in single or double quotes. To prohibit SAS from using a certain sort area, use the following statement prior to sorting. OPTIONS NOSORTWORK ; SAS system is shipped with two command procedures, SORTCHK.COM and SORTSIZE.COM, which are located in the SAS$ROOT:[USAGE] directory. These procedures can help in gathering information about the system. The usage of these DCL commands is explained at the beginning of each file. 3. Indexing Indexing can enhance system performance when creating a large dataset that will be read many times using WHEREclause or BY-group. When an index exists, an observation can be read directly. Without an index, the SAS system will start from the top of the dataset and read all observations sequentially. Only then will it apply the WHERE-clause and OPTIONS COMPRESS=YES; Alternatively, you can use it as dataset-specific option to compress a particular dataset like in the following example: 2 BY-group statement. It is recommended to avoid indexing a dataset that is constantly rewritten, updated or when the dataset needs to be read in its entirety. Indexing consumes more resources than sequential reading when reading a dataset without subsetting. However, if you are subsetting using the WHERE-clause or BY-group statement, the I/O as well as the elapsed time are reduced. The following is an example of indexing: II. OPTIMIZATION TECHNIQUES MORE RELEVANT OR UNIQUE TO OPENVMS 1. Redirecting SAS Work Library By default, SAS work library is created in the directory where the program is running. Due to disk storage limitations, there may not be enough space. There are two ways to tackle this problem. The first way is to use WORK= option at SAS invocation. For example: LIBNAME EMP ‘[ ]’; PROC DATASETS LIBRARY=EMP; MODIFY EMPLOYEE; INDEX CREATE SSNO; INDEX CREATE BTHDATE; RUN; $SAS/WORK=DKA102:[WORKAREA]MYPROG.SAS The other way is to define the logical SAS$WORKROOT and point it to the directory which will be used for the SAS data library. It is recommended that the system manager set up a disk for that purpose and define a system-wide logical pointing to the work area. The logical should be defined as follows: The above code generates two simple indexes on the EMPLOYEE dataset for each of the following two variables social security and birth date. 4. Dataset Space Preallocation $DEFINE/SYSTEM SAS$WORKROOT DKA102:[WORKAREA] Starting from the second 6.09 maintenance release for OpenVMS on AXP, the SAS System initially allocates 129 disk blocks for a data set. This initial allocation is called ALQ or allocation quantity. Each time the data set is extended, another 384 blocks must be allocated on the disk. This allocation is called DEQ or default file extension quantity. OpenVMS maintains a bit map file on each disk (BITMAP.SYS) that identifies the blocks that are available for use. When a data set is written and then extended, OpenVMS alternates between scanning the bit map to locate free blocks, and actually writing the data set. However, if the data set were written with larger initial and extent allocations, writes to the data set would proceed uninterrupted for longer periods of time. At the hardware level, the movement of disk heads between the bit map and the data set are minimized. The result is a reduction in I/Os and elapsed time. Due to the fact that large initial extents preallocate the space reserved for a dataset, disk defragmentation is reduced, thereby cutting the time to read the dataset. The recommended ALQ= value is the size of the dataset. In cases of uncertainty, an underestimated ALQ= can be used and the DEQ= value can be used for extents. The value of ALQ= ranges from 0 to 2,147,483,647 and the value of DEQ= ranges from 0 65,535. For example: This disk must be defragmented periodically and the work files cleaned regularly. To use the cleanup utility, first declare a DCL symbol CLEANUP as follows: $ CLEANUP == ”SAS$ROOT:[PROCS]CLEANUP.EXE To use it, just issue the following command: $ CLEANUP SAS$WORKROOT 2. Removing Unnecessary Datasets from the Work Library If SAS workspace is limited in disk space allocation, any temporary dataset that was created and not used later in the program would take disk storage space unnecessarily. The SAS system gets rid of unused datasets through PROC DATASETS. The following is a basic example of the use of PROC DATASETS: PROC DATASETS MEMTYPE=(CATALOG,DATA); DELETE ONE; DATA X.BIGFILE (ALQ=750000 DEQ=25000); INPUT VAR1-VAR100 $200; RUN; In this example, the catalog and dataset ONE are permanently removed from the WORK library even before the program terminates. Most novices of OpenVMS make a mistake by using the same dataset name over and over. Unlike other platforms, OpenVMS does not overwrite the previous dataset; it creates a new version for each repeated dataset. Although this feature can be turned off, most sites prefer to keep it because it instantly backs up old files. Using the DATASETS procedure to remove the dataset will delete all versions. When processing large files on OpenVMS, it is recommended to use different names for different datasets or use DCL within the SAS program to regularly purge it. For example, if the WORK library is in the Caution must be exercised not to use ALQ= or DEQ= values that are incompatible with data set size for they may result in performance degradation. The size of the dataset may be estimated by using the number of observations, the number of variables and the variable length. 3 directory where the SAS program is run, the following statement may be included in the program: operations. The BUFSIZE= option sets the dataset’s page size when the data set is created. It is important to note that BUFSIZE= can only be set on file creation. The CACHESIZ= options on the other hand can change anytime a file is open and is only set for the life of the current open file. While the BUFSIZE= option can only appear as a dataset option, the CACHESIZ= can appear as a data set option, or on a LIBNAME statement that uses the Base engine. If appropriate values are chosen for a particular dataset, there is a significant decrease in elapsed time and I/O. When your dataset observation size is large, you may waste a great deal of space in the dataset if you do not choose an appropriate BUFSIZE=. Before you set the BUFSIZE=, you should know that the BUFSIZE= options sets the SAS internal page size for the data set and once set, it becomes a permanent attribute of the file that cannot be changed. If for example BUFSIZE= is set to 51,200 and the last page contains only 5000 bytes, you could be wasting over 45,000 bytes or 90 blocks. The following are examples of the use of BUFSIZE= and CACHESIZ= options: X ’PURGE [.SASW*]’; 4. Disabling Disk Volume Highwater Marking Highwater marking (HWM) is an OpenVMS security attribute which guarantees that users cannot read data they have not written. The system erases the previous content of disk blocks for files that are opened for random access. This creates more overhead every time a dataset is created or extended. Since all SAS data sets are random access files, there is a performance penalty of pre-zeroing, increased I/Os, and increased elapsed time. The following is an example of a DCL command to turn off high water mark: $! USE AT INITIALIZATION TIME $ INITIALIZE/NOHIGHWATER DKA100 USERDISK1 LIBNAME X '[ ]'; DATA X.BIGFILE (BUFSIZE=63488); SET ONE; RUN; or LIBNAME X '[ ]'; DATA CACHE.BIGFILE (CACHESIZ=65024); RUN; $! USE FOR AN ACTIVE DISK $ SET VOLUME/NOHIGHWATER DKA102 Turning off highwater marking can significantly reduce the elapsed time and the I/O especially for programs that are write intensive. The only cost for turning off this attribute is that some OpenVMS sites may require the highwater marking feature be running for security purpose. 7. Installing SAS System Images 5. Disk Defragmentation SAS images are a collection of procedures and data bound together by the Linker utility for form an executable program. Installing SAS images can conserve memory because one copy of the code needs to be in memory at any time and many users can access the code concurrently. The benefit of installing SAS known images is that elapsed time of system startup may decrease significantly. Installing images is most effective on systems where two or more users are using SAS concurrently. For example, the following commands are used to install the core set of SAS images of the Release 6.12 of SAS for AXP OpenVMS: Disk defragmentation is the process that causes files to become physically contiguous. Contiguous files can be accessed with fewer I/O operations than non-contiguous files. On a defragmented disk , datasets are kept contiguous; after one I/O operation the disk head is well positioned for the next I/O operation. It is recommended to maintain frequently accessed datasets on a defragmented disk. Running a SAS program on defragmented disk can decrease the I/O and the elapsed time. However, defragmenting can prove costly because of the time and effort to regularly defragment disks or acquiring additional disk drives. The two ways to defragment a disk are to do an image BACKUP and RESTORE to the target disk or to use a commercially available disk defragmentation product. Caution should be exercised in using commercial disk defragmentation because they may corrupt concurrent datasets. Defragmentation may also be reduced by performing a disk-to-disk image backup without using the /SAVE_SET qualifier. $ INSTALL :== $SYS$SYSTEM:INSTALL/COMMAND $ INSTALL ADD/OPEN/HEADER/SHARE SAS$ROOT:[IMAGE]SAS612.EXE $ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SABXSPH.EXE $ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SASDS.EXE $ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SABMOTIF.EXE $ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SASMSG.EXE 6. Caching and Buffering Datasets for Sequential Writes and Reads When your programs constantly perform sequential I/O operations, then using the CACHESIZ= and the BUFSIZE= options may be beneficial. The host CACHESIZ= option controls the buffering of data set pages during I/O 4 8. Increasing the Page File Quota IV. BIBLIOGRAPHY To processing a large file, the page file quota (PGFLQUOTA) of your OpenVMS account needs to be high. Your page file quota determines the virtual memory allocated to your SAS process. However, depending on the dataset size, this quota should be increased. If your quota is low, your program will run out of memory. SAS Institute recommends an initial page file quota of 150,000. To check how your program is using the page file, just insert the following statement in your programs: SAS Institute References: Installation Instructions and System Manager's Guide, Release 6.12 of the SAS System under OpenVMS for AXP Systems. SAS Language Reference, Version 6, First Edition SAS Companion for the VMS Environment, Version 6, Second Edition SAS Programming Tips: A Guide to efficient SAS Processing. X’@SAS$ROOT:[USAGE]MEMCHK.COM ‘; Digital Equipment Corporation References: This command file provided by SAS Institute reports on the current values of the various parameters and quotas, and what levels of memory have been used. This function may be used in conjunction with the OpenVMS accounting utility to determine the optimal quota. Guide to OpenVMS File Applications, March 1994 OpenVMS System Manager's Manual: Essentials, March 1994 OpenVMS System Manager's Manual: Tuning, Monitoring, and Complex Systems, March 1994 OpenVMS DCL Dictionary, A-M, March 1994 OpenVMS DCL Dictionary, N-Z, March 1994 9. Batch Processing ACKNOWLEDGMENTS: if a SAS program takes several hours to run and it is run noninteractively (e.g. $ sas myprog.sas), any problem with your terminal (power failure, frozen screen) can cause the program to stop after several hours of execution. To avoid this problem, you should run the program in batch mode. After submitting a batch job, your terminal session be free for further programming. You can request that OpenVMS notify you after the program is finished. In most organizations, CPUs and I/O are almost idle at night and under intense pressure during the day. Batch processing can actually reduce this problem by scheduling SAS jobs at night when the system is not busy. Batch is not suitable if the program requires user input during execution. To run SAS in batch mode, edit a DCL command file (e.g. myjob.com) and include all programs you need to be run and issue the following command: The author wishes to thank Linda McGrillies, Rama Jampani and Guy Noce for their invaluable assistance. TRADEMARKS SAS is a registered trademark of SAS Institute Inc, in the USA. ® indicates registration. AXP and OpenVMS are registered trademarks of Digital Equipment Corporation. Other brand and product name are registered trademarks of their respective companies. AUTHOR INFORMATION The author may be contacted by e-mail to: [email protected] $SUBMIT/NOTIFY/ AFTER=08-OCT-1997:01:30 MYJOB.COM The programs in the DCL commad file myjob.com is scheduled to run on October 8, 1997 at 1:30 am. Running a program in batch mode is an efficient way of using system resources. The elapsed time depends on the number of other jobs running at the same time. III. CONCLUSION Processing large files is becoming increasingly common as more organizations are looking for new ways for leveraging existing data. Programmers are now being asked to process large files more frequently than in the past. The optimization techniques discussed in this paper will help programmers get more out of the existing computer resources. 5