Download Some Considerations in Designing SAS®-Based Data Management Systems for Large International Epidemiologic Studies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Big data wikipedia , lookup

Clusterpoint wikipedia , lookup

Database model wikipedia , lookup

Healthcare Cost and Utilization Project wikipedia , lookup

Transcript
Some Considerations in Designing SAS®-Based Data Management Systems
for Large International Epidemiologic Studies
C. Johnson: Merck Research Laboratories, Epidemiology
M. Arrighi: University of Nonh Carolina Depanment of Epidemiology
P. Van Grevenhoj- Mayo Clinic, Information Services, Statistical Systems
R. Mueller: Mayo Clinic, Information Services, Statistical Systems
L. Panser: Mayo Clinic, Health Sciences Research, Clinical Epidemiology Section
R. Lee: University of Edinburgh, Depanment of Public Health Sciences, Scotland
C. Chute: Mayo Clinic, Medical Information Resources
1.
natural history of BPH in men aged 40-79. Baseline
data collected for both the U.S. and Scotland studies
consisted of urinary flow rates (ml/second) and
questionnaire information collected from eligible men in
the community (in-home component), and clinical
evaluation for a subset of participants (clinic cohort).
Abstract
Special considerations apply when processing large
databases in a PC SAS® environment. A SAS®-based
data management system was developed to handle data
stemming from a large international epidemiologic study
to investigate the natural history of benign prostatic
hyperplasia.
Over 4000 men aged 40-79 are
participating from Olmsted County, MN and Scotland
Stirling district communities.
Follow-up visits are
planned. An eventual goal of combining data from
these transatlantic sites focused initial efforts on
comparability of databases, resulting in the exclusive use
of SAS® for both data storage and data processing. The
massive amount of data collected via self-administered
questionnaires, urinary flow rate devices and clinical
evaluation along with the desire to retain processing on
microcomputers forced considerations of efficiency in
designing databases. Procedures for data flow, including
data transfer, backup and error correction were
developed along with approaches for developing
summary analysis files. This paper will summarize the
procedures developed to process data from such a large
study where diverse data entry procedures at each site
were used. The design of comparable databases and
data flow structure will be reviewed, with focus on
approaches to increase efficiency and handle difficulties
encountered in the PC SAS environment.
II.
The objective of both investigations was to study the
natural history of BPH in terms of symptomatology,
objective urinary flow capacity. and prostate growth as
men age. The studies were designed for long-term
follow-up with information collected approximately every
two years.
Follow-up consists of mail-based
questionnaires, and repeat urologic evaluation of the
clinic cohort. Since it was known that comparisons of
the U.S. and Scotland studies were of interest, combined
or at least comparable databases were necessary to
allow summarization and statistical analyses by a
statistician at Merck Research Laboratories.
Such
considerations were important in designing the database
and developing data processing procedures.
This paper will focus on the practical strategies used in
designing a common database as well as procedures for
transfer and processing data from these large studies in
a microcomputing environment. The review will be
more practical than technical in focus, and will conclude
with a discussion of planned improvements based on
experience with baseline data collection procedures.
Introduction
III.
Benign prostatic hyperplasia (BPH) is a common
condition in older men in the U.S. and worldwide (Barry
1990). BPH has been defmed as enlargement of the
prostate, causing varying degrees of bladder outlet
obstruction (Merck manual, Guess), with associated
urinary symptoms.
Community-based studies were
initiated in Olmsted County, MN and the Stirling district
of Scotland in late 1989 and early 1990 to study the
Design of Studies
Olmsted County Community Study
Identification of more than 96 % of eligible men was
possible from records of three major medical facilities
through the Rochester Project (Kurland), allowing
stratified random sampling of the Olmsted County
population based on age and residency. Residents
without conditions interfering with assessment of BPH
1
318
were invited by letter to participate in the study.
Participants were visited in their home where urinary
flow rate was measured, a validated questionnaire
(Epstein) was administered, and additional information
pertaining to family history of urologic disease and daily
medication use was collected. 2119 men (55 % of
eligible men) completed the baseline community aspect
of this study. A random subset of men participating in
the above in-home component was identified for more
thorough clinical evaluatiQn (in-clinic component),
including examinations shown in Table 1.
At the time of designing a common database, the
primary statisticians had little experience with
SAS! Access® and limited experience with
microcomputer-based
databases.
The hardware
environments included VMS and a DOS-based
microcomputer (80386) in Scotland, an IBM'" 3090/600J
mainframe
at Mayo Clinic, and DOS-based
microcomputers (80386) in North Carolina. Since data
entry was being accomplished at Mayo Clinic using
SASIFSP®, and the statistician in Scotland planned to
store data in SAS, the decision was made to design
common SAS datasets to store data for these studies.
Despite diverse data entry procedures at the two sites,
and a variety of platforms, SAS provided a commonality
among all sites. In addition, SAS was being used for the
majority of the statistical analyses and summarizations.
The portability, wide knowledge base of SAS, and
availability of SAS on DOS-based personal computers at
all sites further supported the use of PC SAS datasets.
Transferring data to another database accessible by SAS
at a later time was always considered an alternative.
Stirling District of Scotland Study
In the Scotland Stirling district, subjects comprised all
men between the ages of 40 and 79 drawn from the agesex registers of three group general medical practices.
Subjects identified as eligible were seen at their health
center or visited in their home, and urinary flow rate
was measured. Symptomatology was solicited using a
short screening questionnaire at the time of the in-home
visit, and a separate questionnaire soliciting quality of
life and health-care seeking behavior information· was
left with the participant to be returned by mail. An
estimated 3200 men were eligible to participate in the
community aspect of Scotland study.
Data collected in the Olmsted County and Stirling
district were comparable with the exception of certain
design characteristics. Table 1 displays general data
components collected by the two studies, and clearly
indicates comparability of components of data collection.
However, when specific data elements were examined,
three levels of comparability were apparent: a) identical
information collected in exactly the same way, b) similar
but not identical information, and c) information
collected using different scales or methodology and
considered incomparable. In order to design a common
database, levels of comparability were reviewed,
resulting in the generation of a detailed listing of data
elements including the variable name, variable type and
.length, scale or format, and field content for each site.
Although tedious, such enumeration allowed identifying
discrepancies and helped ensure comparable information
was stored in an identical format (same variable name,
type and format) for later combined analyses.
In two of the Stirling health centers, men who screened
positive' for BPH according to predefmed criteria were
brought into the clinic for urological examination
(Garraway), while all men from the third practice were
scheduled for clinical evaluation. Clinical information
collected was similar to that collected by Mayo Clinic in
the in-clinic component, as shown in Table 1.
IV.
Designing Common Data Structures
The enumeration of the population and random
selection of eligible men in Olmsted County was
accomplished at Mayo Clinic. A comprehensive system
for generating appropriate invitation and reminder
letters, tracking status of participants in terms of
telephone calls, appointments and receipt of mail-based
information was developed and implemented by
Information Services, Statistical Systems and
Department of Health Sciences Research at Mayo
Clinic. Since this paper will focus on the development
of analysis files and a combined database, these
elaborate procedures will not be reviewed. Description
of corresponding procedures in Scotland will also be
omitted. Instead, we describe procedures for designing
a common database, individual site data entry, data
transfer and further processing necessary for building a
combined database and analysis files used by a Merck
statistician at University of North Carolina (UNC).
Likewise, protection against inadvertent combination of
unlike data elements was accomplished by ensuring that
incomparable elements were stored as separate
variables. An example was that some questions in the
Olmsted County and Stirling district questionoaires had
scaled responses which were reversed, or not identical.
Data were initially stored in the way it was collected to
avoid problems with data entry and updating, but as
separate variables for Mayo Clinic and for Stirling. A
conversion (of the Scotland questions) took place before
data were combined in a common data file. SAS
variable labels were used to store warnings that scales
319
with Mayo Clinic's project coordinator and data entry
supervisor to flag questionable but not necessarily
impossible data values for additional verification. These
consistency checks were implemented at Mayo Clinic
after data entry to allow resolution closer to the data
source. A Boolean flag stored in each file indicated
whether such questionable data for a participant had
been verified or corrected.
are reversed at individual sites, and these labels were
permanently stored with the SAS dataset, and are listed
on any PROC Contents output generated.
Use of SAS provided a commonality among sites, was
easily implemented and allowed simple access for data
retrieval. sununarization and analysis. A t,adeoff was a
lack of certain built-in database tools common in other
database management system software, and certain
inefficiencies particular to SAS storage and processing.
A.
2.
In Scotland, data were key-entered by a data preparation
Review of Flow of Data
firm into separate flat files with 100% key verification.
No range or consistency checking was done at data
entry. Data were transferred from diskette to a
microcomputer hard disk for storage. Five PC SAS data
files were created from these ASCII files to store
information from both the in-home and in-clinic
components. The content of these five files differed
slightly from the five files created at Mayo Clinic.
Range and consistency checks comparable to those used
, by Mayo Clinic were then performed, and errors were
resolved. Ouce data were considered clean, the SAS
files were sent to UNC on diskette.
A flowchart depicting data as received from both sites is
shown in Figure 1. Errors identified after data entry
through range and consistency checking were resolved at
the respective site. Data were then downloaded if
necessary and diskettes were sent to UNC for further
processing, ad hoc data checking and eventual
incorporation into a combined database. A brief review
of the varying procedures at each site is warranted
before discussion of that processing may proceed.
1.
Data Entry & Error Resolution in Scotland
Data Entry & Error Resolution at Mayo Clinic
B.
The database at Mayo Clinic was initially desigued
under stringent time constraints, and has been
subsequently modified. Our review will describe the
initial database and its impact on subsequent
procedures, although improvements in database design
will be reviewed in a later section.
Procedures to Build Combined Database
Data were exchanged to test procedures as early in the
project as possible to help identify and resolve problems.
Each data shipment included the entire database <all
complete files) to help ensure that the databases at
UNC agreed with those at the data source (Mayo Clinic,
Stirling), and to avoid potential errors with partial
updating. Data were received from both Scotland and
Mayo Clinic in the form of PC SAS datasets on
diskettes.
At Mayo Clinic, data were entered using SAS/FSP into
five separate files pertaining to distinct components of
data collected during the study. Each file was generally
structured as one observation or record per participant,
although mUltiple records were applicable for some
subjects in two of the files. One Mayo Clinic file
contained information pertaining to all enumerated men
(more than 14,000) originally thought to be eligible for
the study, but only data for eligible study participants
were downloaded for further processing and analysis. A
random number assignment was given to each potential
subject in order to protect confidentiality of patients,
and no confidential patient information was available to
personnel outside of Mayo Clinic.
Some additional safe-guards against errors were used.
For example, a file consisting of all eligible identification
numbers was generated, and all files were checked
against this file to ensure that ineligible men were not
included in any data files.
After data were received and further screened for
validity, data conversions were necessary to handle
multiple questionnaire versions used at Mayo Clinic and
some scale differences between Mayo Clinic and
Scotland questionnaires.
Data from several files
received from Mayo Clinic were normalized by
restructuring each single ohservation per subject into
mUltiple observations per subject. This reduced the size
of the files dramatically, particnlarly for the file
containing family history and daily medication where the
original file used arrays allowing storage of information
Data for the Olmsted County study were entered at
Mayo Clinic using SASIFSP on an IBM 30'10/600J
mainframe (Figure I), with range checks at data entry.
A random sample was identified for double key entry,
but error rates were too low to warrant double entry on
an ongoing basis. Consistency checks were initially
designed by statisticians and programmers in conjunction
320
for up to 10 brothers and up to 15 medications. Most
of the original file consisted of missing data and was
desirable in this project, but necessary due to the
substantial amount of data collected in these large
studies. Original data files received from Mayo Clinic
for only baseline data collection took up about 7.5 MB
of storage space, although individual files ranged from
0.5 to 5.2 MB.
more efficiently stored in a normalized structure, using
only 21 % of the original space required.
Analysis data files were created after necessary
conversions and normalization was performed. Such
files were needed to avoid error in summarization and
analysis of data from these studies.
Additional
processing before creating such files included the
calculation of age and numerous summary score indices
as well as selection of the relevant data item for analysis
(questionnaire or uroflow) where multiple data items
were available for a single participant.
As with any study, the importance of documentation
canoot be underemphasized in large studies involving
multi-national sites.
Processing data would be
impractical
and
the
Efficient and structured use of sub-directories as well as
file naming conventions for data and program files were
necessary due to the number of personnel involved in
the project. Program and data files were organized
under separate subdirectories that were linked to the
first six components of their name. Along with good
program documentation, this approach aided in locating
programs and data when necessary. The source of data
(Mayo Clinic, Scotland, Combined data), along with the
type of data component was also indicated by the name
of the dataset through a filename coding technique.
Such a convention allowed indication of both the general
data or program type and its physical location. The
DOS limitation of only 8 character names restricted
Files pertaining to Mayo Clinic and to Scotland were
convention. Files for both studies were nearly identical
in structure and variables included, and all files
contained a variable to distinguish sites.
Back-up procedures
Data were periodically backed up on 3-112" diskettes at
UNC. Once a clean file was reached at UNC, backups
of the SAS datasets were made each time any change
was made, and those backups were kept along with the
previous hackup. Thus, the current and the previous
versions of UNC SAS files were always available.
During screening procedures in Scotland, backups were
made at least once a week, or more often if numerous
changes were being made. Backups of Mayo Clinic data
were made as data were received, with retention of both
the previous and current backup versions. Programs
were backed up on a reasonable basis, but at least once
a month.
Independent backup procedures were
implemented on a regular schedule by personoe! at each
primary data source (Mayo Clinic, Stirling).
VI.
documentation,
cation between sites is essential to achieve comparability.
stored in separate sub-directories with a similar naming
V.
without such
potential for new personnel joining a project in progress
further demands attention to details. Good communi-
more meaningful file names.
In addition, program filenames were linked to output
through the use of titles and footnotes, so that programs
were easily located. An index of program filenames
linked with program purpose was maintained by all
personnel involved in data processing. These indices
were stored on-line in subdirectories specific to each
analyst. The documentation block of each program
provided additional information.
Efficiency Considerations in Micro-Computer
Enviromnent
The use of SAS labels greatly aided in documentation
efforts, with the added cost of increasing the space
required for storage of the SAS dataset. This practice
All data processing at UNC was performed on IBM
PS/2 Model 70 (80386) personal computers (PCs) with
math co-processors. Two PCs had 120 MB while one
had only a 60 MB hard disk. Third party software was
not used to overcome the DOS 3.3 limitation of 33 MB
partitions of hard disks although it placed constraints on
was well worth the additional storage cost however, due
to the numerous variables involved, and potential
differences between sites.
Formats stored in PC SAS format libraries were
permanently associated with variables in the SAS
datasets. These permanent format libraries were shared
among sites, and helped reduce the chance for
mistranslating codes when processing the data.
Permanent libraries and labels he!ped provide more
readable reports and standardized textual translation of
the maximum amount of work space available to SAS.
Various techniques for improving SAS efficiency have
been discussed extensively (Howard, Horwitz, Birkett,
Davis) both in terms of software and hardware
configuration. Efficiency in data processing was not only
321
codes and variable content among analysts. Formats
were stored in a single library, and only one analyst was
needed to develop code to create the library and labels.
record lengths by increasing available memory.
Temporary datasets were deleted using the DATASETS
procedure immediately after their use to free (work)
disk space. Proc APPEND was used to quickly transfer
one initially large file (5.2 MB) from diskette.
This reduced programming time as well as lines of code,
and eliminated redundant creation of temporary formats
in every program.
Macro processing was also used to enhance efficiency in
processing data received from both Mayo Clinic and
Stirling, Scotland. Conversion programs to incorporate
data into the common database structure were executed
each time data were received. The use of SAS macro®
improVed flexibility and allowed the use of the same
programs for both studies through passed parameters.
Macros can increase code portability and flexibility ,but
can be much more difficult to develop, test and debug
Much of the data collected consisted of categorical
responses to items in the questionnaire, and the
maximum possible response to individual questions was
generally less. than 6. Storing such responses as
character with a length of 1 byte instead of numeric at
a default length of 8 greatly reduced the size of the files,
conserving disk space. The length statement could have
been used with numeric type variables as an alternative,
but the minimum length for storing numeric variables
without loss of precision is 3. This approach would have
once invoked. This concern increases with complication
of the macro, and macros can quickly become complex,
been less efficient considering the large number of
with only a few lines of macro code producing a massive
amount of SAS executable code.
variables where low integer values were relevant.
The use of SAS permanent datasets in analysis-ready
format greatly reduced the input/output (I/O) intensive
processing which may have been necessary without such
strategy. Storing data as character instead of nmneric
1) to
Macros were used in three distinct ways:
encapsnJate critical SAS code and aIJow its repeated use
in various circumstances, 2) to pass critical parameters
to program code 'in a manner similar to subroutines of
lower generation languages, and 3) to standardize code
used for the Mayo and Scotland studies through
conditional branching in execution. Some macros used
the file naming convention described below to allow
branching or conditional execution of code pertaining to
certain components in a study.
where possible was one example at attempts for space
efficiency. Also, length, label and format statements
were used in saving the SAS permanent datasets. SAS
length statements for newly created variables as well as
KEEP or DROP statements were used when processing
data.
The balance between PC and mainframe
processing was periodically evaluated, hut movement to
a mainframe environment was not deemed necessary
is much preferred due to accessibility and cost-savings.
At a minimum, program development could be
implemented on the PC and uploaded to the mainframe
The PC SAS environment was customized through use
of CONFlG.SAS and AUTOEXEC.SAS stored in the
appropriate project directory. UBN AME statements for
Olmsted County and Scotland data files as well as for
the permanent format library were included in the
AUTOEXEC.SAS, which is automatically executed
when SAS is implemented. "%Include" statements in
the AUTOEXEC.SAS file ensured that some general
for execution, if such a move is warranted.
pre-programmed macros were available for execution
During program development, test runs were executed
during any SAS session. Also included were standard
titles, options for hardware (e.g., pagesize=59), and
display manager defaults. Such an approach customized
during processing of data from the baseline visit. In the
future, as follow-up information is received, it may be
necessary to move to a mainframe or UNIX
environment, but retaining control in the microcomputer
using the OBS =
observations.
option for processing 0 or 50
Special
con~ideration
was given to
the environment
efficient programming logic. Any necessary sorting of
the data was generally done at the beginning of a
and
avoided the
redundancy
of
including such statements in every program. Through
CONFIG.SAS, SASWORK and SASUSER directories
were re-routed from the default directory to the disk
partition with the most available storage space. A
. change to the size of file buffers increased the amount
of data available in RAM (an optimal setup was quoted
as 5 buffers of 1024 bytes by Birkett). A 1 MB disk
cache allowed faster access by placing data in memory
in anticipation of its use. The remaining expanded
memory was made accessible to SAS through use of the
program series, and re-sorting was avoided. Reducing
the number of times data were sorted imprOVed
performance. The use of the TAGSORT option with
Proc Sort reduced the work disk space required by only
sorting the key fields. Sorting the larger files may have
been impractical in our microcomputing environments
with limited disk space without the use of this option.
The SORTSlZE option aIJowed sorting files with larger
322
EMS ALL option. SAS 6.04used a maximum of2 MB,
with the first 640K for executing procedures or data
steps, and the additional I MB expanded memory to
queue additional code. SAS display manager windows
were redesigned for optimal performance, and function
keys were redefined by each user.
A disadvantage in using SAS instead of a "true" database
management system is the lack of standard database
tools and capabilities, such as data dictionaries and audit
trails. Although query and retrieval capabilities are
available through SAS/ACCESS, other features are
lacking, and developing such procedures can be costly.
Batch processing was used as much as possible, freeing
nearly 180 kB of conventional memory compared to SAS
display manager mode. Batch processing has a slight
performance advantage over the display manager mode
as well. Lack of work space was another source of
difficulty since disk space can be depleted relatively
quicldy in the PC environment. Only critical files and
software were stored on the PC with less space.
Additional areas of concern exist in the microcomputer
environment as compared to the mainframe or
networked processing environment. With different users
on physically separate machines, ensuring that all SAS
data files and format libraries are exactly identical, and
that such files are not modified locally can be
challenging, particularly in the absence of a network.
Development of electronic routines for such checking
An efficient system of
would have been costly.
communication can help achieve this assurance, and is
essential regardless, particUlarly in an environme~t
where the staff may have a wide range of working hours
and separate physical locations.
VII.
Advantages and Disadvantages in Using SAS
The use of SAS in processing these studies was
particularly advantageous due to the complexity and the
changes that occurred during data collection. Using
SAS for both storage and programming, such changes
and complexity were easily incorporated into the data
processing system. The power of SAS in easily sorting,
manipulating and combining datasets must be weighed
against the speed of processing in the microcomputer
environment, however. SAS may not be as efficient as
other software for large scale permanent systems. The
advantage of decreased
system and program
development time could be allayed by the increased
processing time in some situations. However, in our
case, once analyses files were created, little further data
processing was necessary. and efficient programming
offset some of the inefficiencies associated SAS.
VIII.
Present and Future Changes
Structures and Procedures
to
Data
In processing data from the baseline visit for this study,
many possibilities for improvement were recognized.
Through the expertise of systems analysts and medical
information resources at Mayo Clinic, major changes to
the Olmsted County database design have been made
and conversion of baseline data to the new data
structure is in progress. Basically, the new design
eliminates redundant storage of similar information,
normalizes all data structures, and simulates a more
relational database structure. Smaller files containing
similar information regardless of source (in-home vs inclinic data collection) will be used in storing data, and
should increase efficiency of storage as well as
processing. With long-term follow-up visits planned,
such an approach was mandatory.
SAS can be inefficient in processing large datasets, both
in terms of work (disk) space and memory. The default
sorting routine requires disk space equivalent to twice
the size of the file due to the sort in place algorithm
used.
SAS does not use expanded memory well.
SAS/Graph® can be difficult to implement when large
files or complex graphs are involved due to insufficient
memory, even if executing in batch mode. Excessive
processing time in executing certain procedures can be
frustrating without some type of multi-tasking capability.
In addition, an on-line SAS data dictionary is being
developed by Mayo Clinic. Information pertaining to
each individual variable will be available through this
facility. Details of variable infonnation such as length,
type, label and format will be stored in addition to a
brief textual description and history of the variable. The
data dictionary will also be used to document "key"
variables, which may be used to "join" tables of
information. Such on-line documentation available both
at Mayo Clinic and in North Carolina will be valuable
during data entry, processing and summarization stages.
A foreseeable problem in the use of SAS is the impact
of version changes.
Since SAS is an interpreted
language, version changes where procedures are no
longer supported could result in lost capabilities and the
need to reprogram parts of the system.
Storing
compiled SAS code would reduce the impact of
changing versions, but was not practical in our case.
The use of database management tools such as SQL will
aid in efficient data retrieval for statistical analysis.
323
Acknowledgements:
The authors would like to acknowledge the BPH Natural
History Study Group, particularly Dr. Harry Guess for
his leadership in instigating the project and chairing the
above group, Ms. Rebecca Nelson, for her diligent
supervision of data collection and entry at Mayo Clinic to
ensure high quality data.
Since the new data structures simulate a relational
database, the use of relational database concepts and
tools is a simple extension. The advantage of creating
and storing analysis views through Proc SQL is that even
the novice user can join data without having to know
details necessary for merging data. In addition, the code
to "join" data is executed consistently through SQL,
avoiding potential errors when several analysts are
merging data. Once a view is implemented., analysis
files are always available and up-to-date even though
data may have been recently updated. Although the
SQL code defining the "view"is executed at run-time,
such code is more efficient than merging, particularly
when severa! analysts may be implementing similar code.
SAS, SAS/ACCESS, SAS/AF, SAS/ASSIST, SAS/FSP,
SAS/GRAPH,
and SASIMACRO are registered
trademarks or trademark of SAS Institute, Inc. in the
USA and other countries.
IBM is a registered
trademark of International Business Machines, Inc.
References
Barry, Micbael (1990). Epidemiology and Natural
History of Benign Prostatic Hyperplasia. Urol. Clinic.
North Am. 17: 495-507.
The above changes when combined with creating
analysis file views will eliminate essentially all the data
processing previously necessary for incorporating
Olmsted County data into final analysis files. In the
final combined database structure stored in North
Carolina, redundant storage will be eliminated where
feasible, and the database structure will mimic that at
Mayo Clinic. Data from Scotland will undergo some
minor translation of scale discrepancies, and be
converted to this same structure.
The new data
structure is projected to be much more efficient in terms
of both disk space and memory required for processing.
The Merck Manual (1987). Berkow, Robert (editor-inchief). Rahway, NJ: Merck & Co. p.1635-1636.
Guess, H.A.
Antecedents and
manuscript.
Prostatic
History.
Hyperplasia:
Unpublished
Kurland, LT, Molgaard, CA (1981). The Patient Record
in Epidemiology. Scientific American 245(4): 54-63.
More standardized macros will be developed in the
future to handle the more COmmon tabular or graphical
summarizations and analyses. In addition. a menudriven system for data retrieval, summarization and
analysis is being investigated. A system developed in
SASI AF®, or potentially SASI ASSIST® would give
better data access to non-programmers and nonstatisticians. Pre-programmed tabulations, and a limited
choice of statistically valid analyses may be allowed as
options in such a system.
IX.
Benign
Natural
Epstein, Robert S., Deverka, Patricia A., Chute,
Christopher G., Lieber, Michael M., Oesterling, Joseph
E., Panser, Laurel, Schwartz, Skai W.,Guess, Harry A.,
Patrick, Donald.
(1991). Urinary Symptoms and
Quality of Life Questions Indicative of Obstructive
Benign Prostatic Hyperplasia. Supplement to Urology
38(1), 20-26.
Garraway, W.M.,Collins,G.N.,Lee, R.J. (1991). High
Prevalence of BPH. Lancet 338: 469-471.
Summary
Birkett, Thomas (1988). Dealing with Limitations of
SAS/Stat® Software on a Pc. Proceedings of the
Thirteenth SAS Users Group International Conference,
Cary, NC: SAS® Institute, Inc., 691-694.
The design of databases for large epidemiologic studies
is challenging. particularly where multi-national sites are
involved. The authors have found that more error
resolution and data processing closer to the source of
the data promotes accuracy and data integrity.
Processing such large-scale studies in a microcomputing
environment is feasible through normalized and nonredundant database design, more efficient storage and
optimized use of PC SAS.
Horwitz, Lisa and Thompson, Tim (1988). Efficient
Programming with the SAS® System for Personal
Computers: Save Time, Save Space. Proceedings of the
Thirteenth SAS Users Group International Conference,
Cary, NC: SAS® Institute, Inc., 625-630.
Howard, Neil. (1991).
Efficiency Techniques for
Improving I/O and Processing Time in the Data Step.
The Rochester Project was supported by NIH grant
ARJ0582.
324
Proceedings of the Sixteenth SAS Users Group
International Conference, Cary, NC: SAS® Institute,
Inc., 284-289.
Davis, Gordon and Sweetland, Scott. (1987). Efficiency
and Performance Considerations for the SAS® System
Under PC DOS. Proceedings of the Twelfth SAS Users
Group International Conference, Cary, NC: SAS®
Institute, Inc., 779-787.
Table 1
Overview of BPH Natural History Community
Studies in Men Aged 40-79
Componentl
Measurement
Olmsted County
Mayo Clinic
Community Visits an-home Component)
Sample description
Random Sample
n=2119
Number of participants
Measurements
• Urinary Flow Rate
First 2 Centers
Full Community
n=1627
Scotland
Third Health Center
Full Community
n=331
,/
,/
,/
,/'
,/'
,/
,/'
,/'
• Questionnaire
Symptoms
Quality of Lifelother
.Family HistorylDaily Medication
Clinical evaluation
Sample characteristics
Number of participants
Measurements
.Ultrasound
.Digital rectal examination
.Symptoms
oUrinary flow rate
• Pbysical Exam
oUrinalysis
oBloodwork
,/
Random sample of
Participants who
community participants screened positive
n=471
n=492
Full community
n=331
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
,/
'Symptoms solicited via screening questionnaire completed at in-home visit; other information in a mail-based
questionnaire.
325
Mayo Clinic
Random
Sample/
Medical
Screenin
..
©
Data Entry
SAS/FSP
with Range
Checks
~
~eoo7I
Scotland
Screening
Medical
History
Errors
H
Data Entry
keypunch
Resolve
Errors
(Down""'
!O " (
diskettes
files
'"
Consistency
Check
R
Figure 1
.....
PC
:creening.
Consistency
Backup data
on diskette
and Range
Checks
I\)
0>
'"
Load data on
hard disks
UNC
Screening.
Consistency
Checks
File
Conversion:
Combined
Databases
Error
Conversions
Norrnalizatio
Translation
Clean
Reports
A'
-=
~r;;;;;;;r
cn!;~
(
UroHow
\
.,.,,,, \~~~
~ra;;;;;;;;;Jr.=:J
( (Dail;7.
~=#( (~d (urOfIOW( ~Famitt
History ~
summary &
Anattsis
~~~