Download Optimization Techniques for Processing Large Files in an OpenVMS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Asynchronous I/O wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Computer file wikipedia , lookup

Open data in the United Kingdom wikipedia , lookup

Disk formatting wikipedia , lookup

Transcript
Optimization Techniques for Processing Large Files in an OpenVMS
Environment
Adrien Ndikumwami, Regional Economic Studies Institute, Towson, MD
system are discussed. Wherever appropriate, several
examples are provided to demonstrate the use of these
techniques. Costs and benefits with regard to the use of
computer resources (I/O, disk storage, CPU and memory)
are also discussed for each optimization technique.
ABSTRACT
Despite the recent technological advances being made in
the computer industry, both in terms of availability of
enhanced processing power and large storage devices, the
demand for these resources is also increasing considerably.
Unlike in the past, large data files spanning several
gigabytes are now being routinely processed for generating
regular and ad hoc reports. Such regular use of large files is
forcing programmers to look for new techniques that enable
optimal use of scarce computer resources. While the
SAS® system provides advanced capabilities to handle and
manipulate large files, an appreciation of the host operating
system by the programmer is essential to efficiently use the
scarce computer resources. For large data files, the way
data are read, processed, and stored can make a lot of
difference in the optimal utilization of resources. This paper
examines the issues that a SAS programmer needs to
address while processing large files in an OpenVMS
environment. Optimization techniques that are related to
creating and setting datasets, sorting, indexing,
compressing, running jobs in batch mode, and certain other
host specific issues are explored as they are applicable to
an OpenVMS environment.
II. OPTIMIZATION TECHNIQUES THAT ARE
RELEVANT TO OTHER OPERATING
SYSTEMS
1. Reducing dataset observation length
Reducing a dataset observation length reduces I/O and
thereby shortens the elapsed time as the amount of input
data decreases. There are several methods that help
decrease the length of an observation. The more popular
techniques are the DROP= and KEEP= dataset options,
the LENGTH statement and the data compression feature.
a. KEEP= AND DROP= Data Options
While creating a dataset, you should only keep variables
needed for further or future processing. You can
accomplish this by using the KEEP= or the DROP=
statements. By reducing the number of variables in a large
dataset, you are actually reducing the amount of data to be
read or written. This results in significant reduction of I/O
and elapsed time. This technique is beneficial when the
variables being dropped are no longer needed or can be
recreated using the remaining variables. If unsure whether
to retain or drop certain variables, it is recommended that
you keep them. The cost of recreating dropped variables
far outweighs the benefit of keeping them. Assuming
dataset ONE has variables A, B, C, X, Y, Z the following
examples demonstrate the use of DROP= and KEEP=
options:
I. INTRODUCTION
It seems paradoxical that despite the availability of higher
processing power and larger storage space, there is an ever
growing need for these resources. The reality is that unlike
in the past, large data files spanning several gigabytes are
now being routinely processed for generating regular and
ad hoc reports. Most of these operational data files could
not be used on a regular basis due to computing resource
constraints. The
proliferation of large data files and the increasing need for
accessing such files on a regular basis for ad-hoc reporting,
micro-simulation, trend analysis and data mining activities
pose new challenges to the programmer.
DATA TWO;
SET ONE (DROP=C Z);
RUN;
or
DATA TWO;
SET ONE (KEEP=A B X Y);
RUN;
Programmers constantly look for alternative methods to
make optimal use of existing computer resources. While
the SAS system provides advanced capabilities to handle
and manipulate large files, an appreciation of the host
operating system by the programmer is essential to
efficiently manage scarce computer resources. While
some of the techniques discussed in this paper are
applicable to all operating systems, the focus of this paper is
on processing large data files in an OpenVMS environment.
b. The LENGTH statement
LENGTH statements may be used in a data step to reduce
the number of bytes used for storage. The default length of
numeric variables on SAS system is 8 bytes. Often, disk
storage is wasted because a smaller length could be used
to store the same value stored in an 8-bytes without
Optimization techniques that are relevant to most operating
systems are initially explored along with their OpenVMSspecific features. In the later sections, certain techniques
that are either unique or more relevant in an OpenVMS
1
compromising on precision . See Table 1 for the largest
integer that can be represented for each length. Using the
length statement can significantly reduce the storage space
although the CPU time may slightly increase. This feature
comes handy when a dataset contains numeric variables
with small values. Using the LENGTH statement is not
recommended for storing fractions of numeric values
because the precision may be lost.
LIBNAME X ‘[ ]’;
DATA X.BIGDATA (COMPRESS=YES);
INPUT VAR1-VAR100 $200.;
RUN;
2. Sorting
Sorting may consume a considerable amount of computing
resources. In many cases, the program will not work
because resources have not been allocated properly. By
default sorting on AXP OpenVMS is routed to
SYS$SCRATCH which in turn points to SYS$LOGIN.
Typically, SYS$LOGIN resides on a quota-enabled disk that
may not be have enough space to sort a large dataset. For
large datasets, it is advisable to use the host sort. Sorting
takes about 2 to 3 times the size of the input dataset. If
your LOGIN disk has a small quota and you need to
process a large file, the best way is to point
SYS$SCRATCH to a disk with more space. The following
command defines another sort area and should be included
in the LOGIN.COM:
Table 1: Largest integer that can be stored by SAS
variables on AXP OpenVMS by variable length
Length (bytes)
Largest integer
represented exactly
3
4
5
6
7
8
8,191
2,097,151
536,870,911
137,438,953,471
35,184,372,088,831
9,007,199,254,740,991
$DEFINE/PROCESS SYS$SCRATCH DKA102:[SORTAREA]
The following is an example on how to use LENGTH
statement:
From AXP OpenVMS maintenance release TS048 of SAS
system, a new SORTWORK option has been added that
directs the SAS System to create up to 6 SORTWORK
logicals that will be used by the OpenVMS host sort for
temporary work space. The SORTWORK option is used
as follows:
DATA ONE;
LENGTH X Y Z 4;
INPUT X Y Z;
RUN;
c. Dataset Compression
LIBNAME SWORK1 DKB2:[SORTAREA1];
LIBNAME SWORK2 DKB2:[SORTAREA1];
The SAS system is equipped with a powerful compression
algorithm. This function treats an entire observation as a
string of bytes. It ignores variable types and boundaries.
Data compression is more helpful whenever a dataset has
repeating numeric or character data. Compression
reduces the I/O but uses more CPU time. In addition, disk
storage space is reduced by the compression factor. If
your site does not have many users or does not carry other
CPU-intensive applications, or the CPU time is free (no
monetary charge for using the CPU), then compression is
an ideal technique.
Over the last decade, the speed of OpenVMS processors
has increased by a factor more than 20, while at the same
time, disk I/O systems have merely sped up by a factor of 2.
Shifting the I/O pressure to the CPU is the best way to
solve this problem. However, if the dataset does not have
many repeating values, you should avoid compressing
because performance may get worse. Under certain
circumstances, SAS Compression may actually increase
the size of a large dataset. In this case the I/Os, the CPU
time and elapsed time increase. You can use SAS
compression as a global option as in the following example:
OPTIONS SORTWORK=(SWORK1, SWORK2) ;
The SORTWORK option accepts both SAS librefs
assigned in a LIBNAME statement or an OpenVMS path
that must be enclosed in single or double quotes. To
prohibit SAS from using a certain sort area, use the
following statement prior to sorting.
OPTIONS NOSORTWORK ;
SAS system is shipped with two command procedures,
SORTCHK.COM and SORTSIZE.COM, which are located
in the SAS$ROOT:[USAGE] directory. These procedures
can help in gathering information about the system. The
usage of these DCL commands is explained at the
beginning of each file.
3. Indexing
Indexing can enhance system performance when creating
a large dataset that will be read many times using WHEREclause or BY-group. When an index exists, an observation
can be read directly. Without an index, the SAS system will
start from the top of the dataset and read all observations
sequentially. Only then will it apply the WHERE-clause and
OPTIONS COMPRESS=YES;
Alternatively, you can use it as dataset-specific option to
compress a particular dataset like in the following example:
2
BY-group statement. It is recommended to avoid indexing
a dataset that is constantly rewritten, updated or when the
dataset needs to be read in its entirety. Indexing consumes
more resources than sequential reading when reading a
dataset without subsetting. However, if you are subsetting
using the WHERE-clause or BY-group statement, the I/O
as well as the elapsed time are reduced. The following is an
example of indexing:
II. OPTIMIZATION TECHNIQUES MORE
RELEVANT OR UNIQUE TO OPENVMS
1. Redirecting SAS Work Library
By default, SAS work library is created in the directory
where the program is running. Due to disk storage
limitations, there may not be enough space. There are two
ways to tackle this problem. The first way is to use WORK=
option at SAS invocation. For example:
LIBNAME EMP ‘[ ]’;
PROC DATASETS LIBRARY=EMP;
MODIFY EMPLOYEE;
INDEX CREATE SSNO;
INDEX CREATE BTHDATE;
RUN;
$SAS/WORK=DKA102:[WORKAREA]MYPROG.SAS
The other way is to define the logical SAS$WORKROOT
and point it to the directory which will be used for the SAS
data library. It is recommended that the system manager
set up a disk for that purpose and define a system-wide
logical pointing to the work area. The logical should be
defined as follows:
The above code generates two simple indexes on the
EMPLOYEE dataset for each of the following two variables social security and birth date.
4. Dataset Space Preallocation
$DEFINE/SYSTEM SAS$WORKROOT DKA102:[WORKAREA]
Starting from the second 6.09 maintenance release for
OpenVMS on AXP, the SAS System initially allocates 129
disk blocks for a data set. This initial allocation is called
ALQ or allocation quantity. Each time the data set is
extended, another 384 blocks must be allocated on the
disk. This allocation is called DEQ or default file extension
quantity. OpenVMS maintains a bit map file on each disk
(BITMAP.SYS) that identifies the blocks that are available
for use. When a data set is written and then extended,
OpenVMS alternates between scanning the bit map to
locate free blocks, and actually writing the data set.
However, if the data set were written with larger initial and
extent allocations, writes to the data set would proceed
uninterrupted for longer periods of time. At the hardware
level, the movement of disk heads between the bit map and
the data set are minimized. The result is a reduction in
I/Os and elapsed time. Due to the fact that large initial
extents preallocate the space reserved for a dataset, disk
defragmentation is reduced, thereby cutting the time to read
the dataset. The recommended ALQ= value is the size of
the dataset. In cases of uncertainty, an underestimated
ALQ= can be used and the DEQ= value can be used for
extents. The value of ALQ= ranges from 0 to 2,147,483,647
and the value of DEQ= ranges from 0 65,535. For
example:
This disk must be defragmented periodically and the work
files cleaned regularly. To use the cleanup utility, first
declare a DCL symbol CLEANUP as follows:
$ CLEANUP == ”SAS$ROOT:[PROCS]CLEANUP.EXE
To use it, just issue the following command:
$ CLEANUP SAS$WORKROOT
2. Removing Unnecessary Datasets from the
Work Library
If SAS workspace is limited in disk space allocation, any
temporary dataset that was created and not used later in the
program would take disk storage space unnecessarily.
The SAS system gets rid of unused datasets through
PROC DATASETS. The following is a basic example of
the use of PROC DATASETS:
PROC DATASETS MEMTYPE=(CATALOG,DATA);
DELETE ONE;
DATA X.BIGFILE (ALQ=750000 DEQ=25000);
INPUT VAR1-VAR100 $200;
RUN;
In this example, the catalog and dataset ONE are
permanently removed from the WORK library even before
the program terminates. Most novices of OpenVMS make a
mistake by using the same dataset name over and over.
Unlike other platforms, OpenVMS does not overwrite the
previous dataset; it creates a new version for each repeated
dataset. Although this feature can be turned off, most sites
prefer to keep it because it instantly backs up old files.
Using the DATASETS procedure to remove the dataset will
delete all versions. When processing large files on
OpenVMS, it is recommended to use different names for
different datasets or use DCL within the SAS program to
regularly purge it. For example, if the WORK library is in the
Caution must be exercised not to use ALQ= or DEQ=
values that are incompatible with data set size for they may
result in performance degradation. The size of the dataset
may be estimated by using the number of observations, the
number of variables and the variable length.
3
directory where the SAS program is run, the following
statement may be included in the program:
operations. The BUFSIZE= option sets the dataset’s page
size when the data set is created. It is important to note that
BUFSIZE= can only be set on file creation. The
CACHESIZ= options on the other hand can change
anytime a file is open and is only set for the life of the current
open file. While the BUFSIZE= option can only appear as
a dataset option, the CACHESIZ= can appear as a data set
option, or on a LIBNAME statement that uses the Base
engine. If appropriate values are chosen for a particular
dataset, there is a significant decrease in elapsed time and
I/O. When your dataset observation size is large, you may
waste a great deal of space in the dataset if you do not
choose an appropriate BUFSIZE=. Before you set the
BUFSIZE=, you should know that the BUFSIZE= options
sets the SAS internal page size for the data set and once
set, it becomes a permanent attribute of the file that cannot
be changed. If for example BUFSIZE= is set to 51,200 and
the last page contains only 5000 bytes, you could be
wasting over 45,000 bytes or 90 blocks. The following are
examples of the use of BUFSIZE= and CACHESIZ=
options:
X ’PURGE [.SASW*]’;
4. Disabling Disk Volume Highwater Marking
Highwater marking (HWM) is an OpenVMS security
attribute which guarantees that users cannot read data
they have not written. The system erases the previous
content of disk blocks for files that are opened for random
access. This creates more overhead every time a dataset is
created or extended. Since all SAS data sets are random
access files, there is a performance penalty of pre-zeroing,
increased I/Os, and increased elapsed time. The following
is an example of a DCL command to turn off high water
mark:
$! USE AT INITIALIZATION TIME
$ INITIALIZE/NOHIGHWATER DKA100 USERDISK1
LIBNAME X '[ ]';
DATA X.BIGFILE (BUFSIZE=63488);
SET ONE;
RUN;
or
LIBNAME X '[ ]';
DATA CACHE.BIGFILE (CACHESIZ=65024);
RUN;
$! USE FOR AN ACTIVE DISK
$ SET VOLUME/NOHIGHWATER DKA102
Turning off highwater marking can significantly reduce the
elapsed time and the I/O especially for programs that are
write intensive. The only cost for turning off this attribute is
that some OpenVMS sites may require the highwater
marking feature be running for security purpose.
7. Installing SAS System Images
5. Disk Defragmentation
SAS images are a collection of procedures and data bound
together by the Linker utility for form an executable program.
Installing SAS images can conserve memory because one
copy of the code needs to be in memory at any time and
many users can access the code concurrently. The benefit
of installing SAS known images is that elapsed time of
system startup may decrease significantly. Installing images
is most effective on systems where two or more users are
using SAS concurrently. For example, the following
commands are used to install the core set of SAS images of
the Release 6.12 of SAS for AXP OpenVMS:
Disk defragmentation is the process that causes files to
become physically contiguous. Contiguous files can be
accessed with fewer I/O operations than non-contiguous
files. On a defragmented disk , datasets are kept
contiguous; after one I/O operation the disk head is well
positioned for the next I/O operation. It is recommended to
maintain frequently accessed datasets on a defragmented
disk. Running a SAS program on defragmented disk can
decrease the I/O and the elapsed time. However,
defragmenting can prove costly because of the time and
effort to regularly defragment disks or acquiring additional
disk drives. The two ways to defragment a disk are to do an
image BACKUP and RESTORE to the target disk or to use
a commercially available disk defragmentation product.
Caution should be exercised in using commercial disk
defragmentation because they may corrupt concurrent
datasets. Defragmentation may also be reduced by
performing a disk-to-disk image backup without using the
/SAVE_SET qualifier.
$ INSTALL :== $SYS$SYSTEM:INSTALL/COMMAND
$ INSTALL ADD/OPEN/HEADER/SHARE SAS$ROOT:[IMAGE]SAS612.EXE
$ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SABXSPH.EXE
$ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SASDS.EXE
$ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SABMOTIF.EXE
$ INSTALL ADD/OPEN/HEADER/SHARE SAS$LIBRARY:SASMSG.EXE
6. Caching and Buffering Datasets for
Sequential Writes and Reads
When your programs constantly perform sequential I/O
operations, then using the CACHESIZ= and the BUFSIZE=
options may be beneficial. The host CACHESIZ= option
controls the buffering of data set pages during I/O
4
8. Increasing the Page File Quota
IV. BIBLIOGRAPHY
To processing a large file, the page file quota
(PGFLQUOTA) of your OpenVMS account needs to be
high. Your page file quota determines the virtual memory
allocated to your SAS process. However, depending on
the dataset size, this quota should be increased. If your
quota is low, your program will run out of memory. SAS
Institute recommends an initial page file quota of 150,000.
To check how your program is using the page file, just
insert the following statement in your programs:
SAS Institute References:
Installation Instructions and System Manager's Guide,
Release 6.12 of the SAS System under OpenVMS for AXP
Systems.
SAS Language Reference, Version 6, First Edition
SAS Companion for the VMS Environment, Version 6,
Second Edition
SAS Programming Tips: A Guide to efficient SAS
Processing.
X’@SAS$ROOT:[USAGE]MEMCHK.COM ‘;
Digital Equipment Corporation References:
This command file provided by SAS Institute reports on the
current values of the various parameters and quotas, and
what levels of memory have been used. This function may
be used in conjunction with the OpenVMS accounting
utility to determine the optimal quota.
Guide to OpenVMS File Applications, March 1994
OpenVMS System Manager's Manual: Essentials, March
1994
OpenVMS System Manager's Manual: Tuning, Monitoring,
and Complex Systems, March 1994
OpenVMS DCL Dictionary, A-M, March 1994
OpenVMS DCL Dictionary, N-Z, March 1994
9. Batch Processing
ACKNOWLEDGMENTS:
if a SAS program takes several hours to run and it is run
noninteractively (e.g. $ sas myprog.sas), any problem with
your terminal (power failure, frozen screen) can cause the
program to stop after several hours of execution. To avoid
this problem, you should run the program in batch mode.
After submitting a batch job, your terminal session be free
for further programming. You can request that OpenVMS
notify you after the program is finished. In most
organizations, CPUs and I/O are almost idle at night and
under intense pressure during the day. Batch processing
can actually reduce this problem by scheduling SAS jobs at
night when the system is not busy. Batch is not suitable if
the program requires user input during execution. To run
SAS in batch mode, edit a DCL command file (e.g.
myjob.com) and include all programs you need to be run
and issue the following command:
The author wishes to thank Linda McGrillies, Rama
Jampani and Guy Noce for their invaluable assistance.
TRADEMARKS
SAS is a registered trademark of SAS Institute Inc, in the
USA. ® indicates registration.
AXP and OpenVMS are registered trademarks of Digital
Equipment Corporation.
Other brand and product name are registered trademarks of
their respective companies.
AUTHOR INFORMATION
The author may be contacted by e-mail to:
[email protected]
$SUBMIT/NOTIFY/ AFTER=08-OCT-1997:01:30 MYJOB.COM
The programs in the DCL commad file myjob.com is
scheduled to run on October 8, 1997 at 1:30 am. Running
a program in batch mode is an efficient way of using system
resources. The elapsed time depends on the number of
other jobs running at the same time.
III. CONCLUSION
Processing large files is becoming increasingly common as
more organizations are looking for new ways for leveraging
existing data. Programmers are now being asked to
process large files more frequently than in the past. The
optimization techniques discussed in this paper will help
programmers get more out of the existing computer
resources.
5