Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Some Considerations in Designing SAS®-Based Data Management Systems for Large International Epidemiologic Studies C. Johnson: Merck Research Laboratories, Epidemiology M. Arrighi: University of Nonh Carolina Depanment of Epidemiology P. Van Grevenhoj- Mayo Clinic, Information Services, Statistical Systems R. Mueller: Mayo Clinic, Information Services, Statistical Systems L. Panser: Mayo Clinic, Health Sciences Research, Clinical Epidemiology Section R. Lee: University of Edinburgh, Depanment of Public Health Sciences, Scotland C. Chute: Mayo Clinic, Medical Information Resources 1. natural history of BPH in men aged 40-79. Baseline data collected for both the U.S. and Scotland studies consisted of urinary flow rates (ml/second) and questionnaire information collected from eligible men in the community (in-home component), and clinical evaluation for a subset of participants (clinic cohort). Abstract Special considerations apply when processing large databases in a PC SAS® environment. A SAS®-based data management system was developed to handle data stemming from a large international epidemiologic study to investigate the natural history of benign prostatic hyperplasia. Over 4000 men aged 40-79 are participating from Olmsted County, MN and Scotland Stirling district communities. Follow-up visits are planned. An eventual goal of combining data from these transatlantic sites focused initial efforts on comparability of databases, resulting in the exclusive use of SAS® for both data storage and data processing. The massive amount of data collected via self-administered questionnaires, urinary flow rate devices and clinical evaluation along with the desire to retain processing on microcomputers forced considerations of efficiency in designing databases. Procedures for data flow, including data transfer, backup and error correction were developed along with approaches for developing summary analysis files. This paper will summarize the procedures developed to process data from such a large study where diverse data entry procedures at each site were used. The design of comparable databases and data flow structure will be reviewed, with focus on approaches to increase efficiency and handle difficulties encountered in the PC SAS environment. II. The objective of both investigations was to study the natural history of BPH in terms of symptomatology, objective urinary flow capacity. and prostate growth as men age. The studies were designed for long-term follow-up with information collected approximately every two years. Follow-up consists of mail-based questionnaires, and repeat urologic evaluation of the clinic cohort. Since it was known that comparisons of the U.S. and Scotland studies were of interest, combined or at least comparable databases were necessary to allow summarization and statistical analyses by a statistician at Merck Research Laboratories. Such considerations were important in designing the database and developing data processing procedures. This paper will focus on the practical strategies used in designing a common database as well as procedures for transfer and processing data from these large studies in a microcomputing environment. The review will be more practical than technical in focus, and will conclude with a discussion of planned improvements based on experience with baseline data collection procedures. Introduction III. Benign prostatic hyperplasia (BPH) is a common condition in older men in the U.S. and worldwide (Barry 1990). BPH has been defmed as enlargement of the prostate, causing varying degrees of bladder outlet obstruction (Merck manual, Guess), with associated urinary symptoms. Community-based studies were initiated in Olmsted County, MN and the Stirling district of Scotland in late 1989 and early 1990 to study the Design of Studies Olmsted County Community Study Identification of more than 96 % of eligible men was possible from records of three major medical facilities through the Rochester Project (Kurland), allowing stratified random sampling of the Olmsted County population based on age and residency. Residents without conditions interfering with assessment of BPH 1 318 were invited by letter to participate in the study. Participants were visited in their home where urinary flow rate was measured, a validated questionnaire (Epstein) was administered, and additional information pertaining to family history of urologic disease and daily medication use was collected. 2119 men (55 % of eligible men) completed the baseline community aspect of this study. A random subset of men participating in the above in-home component was identified for more thorough clinical evaluatiQn (in-clinic component), including examinations shown in Table 1. At the time of designing a common database, the primary statisticians had little experience with SAS! Access® and limited experience with microcomputer-based databases. The hardware environments included VMS and a DOS-based microcomputer (80386) in Scotland, an IBM'" 3090/600J mainframe at Mayo Clinic, and DOS-based microcomputers (80386) in North Carolina. Since data entry was being accomplished at Mayo Clinic using SASIFSP®, and the statistician in Scotland planned to store data in SAS, the decision was made to design common SAS datasets to store data for these studies. Despite diverse data entry procedures at the two sites, and a variety of platforms, SAS provided a commonality among all sites. In addition, SAS was being used for the majority of the statistical analyses and summarizations. The portability, wide knowledge base of SAS, and availability of SAS on DOS-based personal computers at all sites further supported the use of PC SAS datasets. Transferring data to another database accessible by SAS at a later time was always considered an alternative. Stirling District of Scotland Study In the Scotland Stirling district, subjects comprised all men between the ages of 40 and 79 drawn from the agesex registers of three group general medical practices. Subjects identified as eligible were seen at their health center or visited in their home, and urinary flow rate was measured. Symptomatology was solicited using a short screening questionnaire at the time of the in-home visit, and a separate questionnaire soliciting quality of life and health-care seeking behavior information· was left with the participant to be returned by mail. An estimated 3200 men were eligible to participate in the community aspect of Scotland study. Data collected in the Olmsted County and Stirling district were comparable with the exception of certain design characteristics. Table 1 displays general data components collected by the two studies, and clearly indicates comparability of components of data collection. However, when specific data elements were examined, three levels of comparability were apparent: a) identical information collected in exactly the same way, b) similar but not identical information, and c) information collected using different scales or methodology and considered incomparable. In order to design a common database, levels of comparability were reviewed, resulting in the generation of a detailed listing of data elements including the variable name, variable type and .length, scale or format, and field content for each site. Although tedious, such enumeration allowed identifying discrepancies and helped ensure comparable information was stored in an identical format (same variable name, type and format) for later combined analyses. In two of the Stirling health centers, men who screened positive' for BPH according to predefmed criteria were brought into the clinic for urological examination (Garraway), while all men from the third practice were scheduled for clinical evaluation. Clinical information collected was similar to that collected by Mayo Clinic in the in-clinic component, as shown in Table 1. IV. Designing Common Data Structures The enumeration of the population and random selection of eligible men in Olmsted County was accomplished at Mayo Clinic. A comprehensive system for generating appropriate invitation and reminder letters, tracking status of participants in terms of telephone calls, appointments and receipt of mail-based information was developed and implemented by Information Services, Statistical Systems and Department of Health Sciences Research at Mayo Clinic. Since this paper will focus on the development of analysis files and a combined database, these elaborate procedures will not be reviewed. Description of corresponding procedures in Scotland will also be omitted. Instead, we describe procedures for designing a common database, individual site data entry, data transfer and further processing necessary for building a combined database and analysis files used by a Merck statistician at University of North Carolina (UNC). Likewise, protection against inadvertent combination of unlike data elements was accomplished by ensuring that incomparable elements were stored as separate variables. An example was that some questions in the Olmsted County and Stirling district questionoaires had scaled responses which were reversed, or not identical. Data were initially stored in the way it was collected to avoid problems with data entry and updating, but as separate variables for Mayo Clinic and for Stirling. A conversion (of the Scotland questions) took place before data were combined in a common data file. SAS variable labels were used to store warnings that scales 319 with Mayo Clinic's project coordinator and data entry supervisor to flag questionable but not necessarily impossible data values for additional verification. These consistency checks were implemented at Mayo Clinic after data entry to allow resolution closer to the data source. A Boolean flag stored in each file indicated whether such questionable data for a participant had been verified or corrected. are reversed at individual sites, and these labels were permanently stored with the SAS dataset, and are listed on any PROC Contents output generated. Use of SAS provided a commonality among sites, was easily implemented and allowed simple access for data retrieval. sununarization and analysis. A t,adeoff was a lack of certain built-in database tools common in other database management system software, and certain inefficiencies particular to SAS storage and processing. A. 2. In Scotland, data were key-entered by a data preparation Review of Flow of Data firm into separate flat files with 100% key verification. No range or consistency checking was done at data entry. Data were transferred from diskette to a microcomputer hard disk for storage. Five PC SAS data files were created from these ASCII files to store information from both the in-home and in-clinic components. The content of these five files differed slightly from the five files created at Mayo Clinic. Range and consistency checks comparable to those used , by Mayo Clinic were then performed, and errors were resolved. Ouce data were considered clean, the SAS files were sent to UNC on diskette. A flowchart depicting data as received from both sites is shown in Figure 1. Errors identified after data entry through range and consistency checking were resolved at the respective site. Data were then downloaded if necessary and diskettes were sent to UNC for further processing, ad hoc data checking and eventual incorporation into a combined database. A brief review of the varying procedures at each site is warranted before discussion of that processing may proceed. 1. Data Entry & Error Resolution in Scotland Data Entry & Error Resolution at Mayo Clinic B. The database at Mayo Clinic was initially desigued under stringent time constraints, and has been subsequently modified. Our review will describe the initial database and its impact on subsequent procedures, although improvements in database design will be reviewed in a later section. Procedures to Build Combined Database Data were exchanged to test procedures as early in the project as possible to help identify and resolve problems. Each data shipment included the entire database <all complete files) to help ensure that the databases at UNC agreed with those at the data source (Mayo Clinic, Stirling), and to avoid potential errors with partial updating. Data were received from both Scotland and Mayo Clinic in the form of PC SAS datasets on diskettes. At Mayo Clinic, data were entered using SAS/FSP into five separate files pertaining to distinct components of data collected during the study. Each file was generally structured as one observation or record per participant, although mUltiple records were applicable for some subjects in two of the files. One Mayo Clinic file contained information pertaining to all enumerated men (more than 14,000) originally thought to be eligible for the study, but only data for eligible study participants were downloaded for further processing and analysis. A random number assignment was given to each potential subject in order to protect confidentiality of patients, and no confidential patient information was available to personnel outside of Mayo Clinic. Some additional safe-guards against errors were used. For example, a file consisting of all eligible identification numbers was generated, and all files were checked against this file to ensure that ineligible men were not included in any data files. After data were received and further screened for validity, data conversions were necessary to handle multiple questionnaire versions used at Mayo Clinic and some scale differences between Mayo Clinic and Scotland questionnaires. Data from several files received from Mayo Clinic were normalized by restructuring each single ohservation per subject into mUltiple observations per subject. This reduced the size of the files dramatically, particnlarly for the file containing family history and daily medication where the original file used arrays allowing storage of information Data for the Olmsted County study were entered at Mayo Clinic using SASIFSP on an IBM 30'10/600J mainframe (Figure I), with range checks at data entry. A random sample was identified for double key entry, but error rates were too low to warrant double entry on an ongoing basis. Consistency checks were initially designed by statisticians and programmers in conjunction 320 for up to 10 brothers and up to 15 medications. Most of the original file consisted of missing data and was desirable in this project, but necessary due to the substantial amount of data collected in these large studies. Original data files received from Mayo Clinic for only baseline data collection took up about 7.5 MB of storage space, although individual files ranged from 0.5 to 5.2 MB. more efficiently stored in a normalized structure, using only 21 % of the original space required. Analysis data files were created after necessary conversions and normalization was performed. Such files were needed to avoid error in summarization and analysis of data from these studies. Additional processing before creating such files included the calculation of age and numerous summary score indices as well as selection of the relevant data item for analysis (questionnaire or uroflow) where multiple data items were available for a single participant. As with any study, the importance of documentation canoot be underemphasized in large studies involving multi-national sites. Processing data would be impractical and the Efficient and structured use of sub-directories as well as file naming conventions for data and program files were necessary due to the number of personnel involved in the project. Program and data files were organized under separate subdirectories that were linked to the first six components of their name. Along with good program documentation, this approach aided in locating programs and data when necessary. The source of data (Mayo Clinic, Scotland, Combined data), along with the type of data component was also indicated by the name of the dataset through a filename coding technique. Such a convention allowed indication of both the general data or program type and its physical location. The DOS limitation of only 8 character names restricted Files pertaining to Mayo Clinic and to Scotland were convention. Files for both studies were nearly identical in structure and variables included, and all files contained a variable to distinguish sites. Back-up procedures Data were periodically backed up on 3-112" diskettes at UNC. Once a clean file was reached at UNC, backups of the SAS datasets were made each time any change was made, and those backups were kept along with the previous hackup. Thus, the current and the previous versions of UNC SAS files were always available. During screening procedures in Scotland, backups were made at least once a week, or more often if numerous changes were being made. Backups of Mayo Clinic data were made as data were received, with retention of both the previous and current backup versions. Programs were backed up on a reasonable basis, but at least once a month. Independent backup procedures were implemented on a regular schedule by personoe! at each primary data source (Mayo Clinic, Stirling). VI. documentation, cation between sites is essential to achieve comparability. stored in separate sub-directories with a similar naming V. without such potential for new personnel joining a project in progress further demands attention to details. Good communi- more meaningful file names. In addition, program filenames were linked to output through the use of titles and footnotes, so that programs were easily located. An index of program filenames linked with program purpose was maintained by all personnel involved in data processing. These indices were stored on-line in subdirectories specific to each analyst. The documentation block of each program provided additional information. Efficiency Considerations in Micro-Computer Enviromnent The use of SAS labels greatly aided in documentation efforts, with the added cost of increasing the space required for storage of the SAS dataset. This practice All data processing at UNC was performed on IBM PS/2 Model 70 (80386) personal computers (PCs) with math co-processors. Two PCs had 120 MB while one had only a 60 MB hard disk. Third party software was not used to overcome the DOS 3.3 limitation of 33 MB partitions of hard disks although it placed constraints on was well worth the additional storage cost however, due to the numerous variables involved, and potential differences between sites. Formats stored in PC SAS format libraries were permanently associated with variables in the SAS datasets. These permanent format libraries were shared among sites, and helped reduce the chance for mistranslating codes when processing the data. Permanent libraries and labels he!ped provide more readable reports and standardized textual translation of the maximum amount of work space available to SAS. Various techniques for improving SAS efficiency have been discussed extensively (Howard, Horwitz, Birkett, Davis) both in terms of software and hardware configuration. Efficiency in data processing was not only 321 codes and variable content among analysts. Formats were stored in a single library, and only one analyst was needed to develop code to create the library and labels. record lengths by increasing available memory. Temporary datasets were deleted using the DATASETS procedure immediately after their use to free (work) disk space. Proc APPEND was used to quickly transfer one initially large file (5.2 MB) from diskette. This reduced programming time as well as lines of code, and eliminated redundant creation of temporary formats in every program. Macro processing was also used to enhance efficiency in processing data received from both Mayo Clinic and Stirling, Scotland. Conversion programs to incorporate data into the common database structure were executed each time data were received. The use of SAS macro® improVed flexibility and allowed the use of the same programs for both studies through passed parameters. Macros can increase code portability and flexibility ,but can be much more difficult to develop, test and debug Much of the data collected consisted of categorical responses to items in the questionnaire, and the maximum possible response to individual questions was generally less. than 6. Storing such responses as character with a length of 1 byte instead of numeric at a default length of 8 greatly reduced the size of the files, conserving disk space. The length statement could have been used with numeric type variables as an alternative, but the minimum length for storing numeric variables without loss of precision is 3. This approach would have once invoked. This concern increases with complication of the macro, and macros can quickly become complex, been less efficient considering the large number of with only a few lines of macro code producing a massive amount of SAS executable code. variables where low integer values were relevant. The use of SAS permanent datasets in analysis-ready format greatly reduced the input/output (I/O) intensive processing which may have been necessary without such strategy. Storing data as character instead of nmneric 1) to Macros were used in three distinct ways: encapsnJate critical SAS code and aIJow its repeated use in various circumstances, 2) to pass critical parameters to program code 'in a manner similar to subroutines of lower generation languages, and 3) to standardize code used for the Mayo and Scotland studies through conditional branching in execution. Some macros used the file naming convention described below to allow branching or conditional execution of code pertaining to certain components in a study. where possible was one example at attempts for space efficiency. Also, length, label and format statements were used in saving the SAS permanent datasets. SAS length statements for newly created variables as well as KEEP or DROP statements were used when processing data. The balance between PC and mainframe processing was periodically evaluated, hut movement to a mainframe environment was not deemed necessary is much preferred due to accessibility and cost-savings. At a minimum, program development could be implemented on the PC and uploaded to the mainframe The PC SAS environment was customized through use of CONFlG.SAS and AUTOEXEC.SAS stored in the appropriate project directory. UBN AME statements for Olmsted County and Scotland data files as well as for the permanent format library were included in the AUTOEXEC.SAS, which is automatically executed when SAS is implemented. "%Include" statements in the AUTOEXEC.SAS file ensured that some general for execution, if such a move is warranted. pre-programmed macros were available for execution During program development, test runs were executed during any SAS session. Also included were standard titles, options for hardware (e.g., pagesize=59), and display manager defaults. Such an approach customized during processing of data from the baseline visit. In the future, as follow-up information is received, it may be necessary to move to a mainframe or UNIX environment, but retaining control in the microcomputer using the OBS = observations. option for processing 0 or 50 Special con~ideration was given to the environment efficient programming logic. Any necessary sorting of the data was generally done at the beginning of a and avoided the redundancy of including such statements in every program. Through CONFIG.SAS, SASWORK and SASUSER directories were re-routed from the default directory to the disk partition with the most available storage space. A . change to the size of file buffers increased the amount of data available in RAM (an optimal setup was quoted as 5 buffers of 1024 bytes by Birkett). A 1 MB disk cache allowed faster access by placing data in memory in anticipation of its use. The remaining expanded memory was made accessible to SAS through use of the program series, and re-sorting was avoided. Reducing the number of times data were sorted imprOVed performance. The use of the TAGSORT option with Proc Sort reduced the work disk space required by only sorting the key fields. Sorting the larger files may have been impractical in our microcomputing environments with limited disk space without the use of this option. The SORTSlZE option aIJowed sorting files with larger 322 EMS ALL option. SAS 6.04used a maximum of2 MB, with the first 640K for executing procedures or data steps, and the additional I MB expanded memory to queue additional code. SAS display manager windows were redesigned for optimal performance, and function keys were redefined by each user. A disadvantage in using SAS instead of a "true" database management system is the lack of standard database tools and capabilities, such as data dictionaries and audit trails. Although query and retrieval capabilities are available through SAS/ACCESS, other features are lacking, and developing such procedures can be costly. Batch processing was used as much as possible, freeing nearly 180 kB of conventional memory compared to SAS display manager mode. Batch processing has a slight performance advantage over the display manager mode as well. Lack of work space was another source of difficulty since disk space can be depleted relatively quicldy in the PC environment. Only critical files and software were stored on the PC with less space. Additional areas of concern exist in the microcomputer environment as compared to the mainframe or networked processing environment. With different users on physically separate machines, ensuring that all SAS data files and format libraries are exactly identical, and that such files are not modified locally can be challenging, particularly in the absence of a network. Development of electronic routines for such checking An efficient system of would have been costly. communication can help achieve this assurance, and is essential regardless, particUlarly in an environme~t where the staff may have a wide range of working hours and separate physical locations. VII. Advantages and Disadvantages in Using SAS The use of SAS in processing these studies was particularly advantageous due to the complexity and the changes that occurred during data collection. Using SAS for both storage and programming, such changes and complexity were easily incorporated into the data processing system. The power of SAS in easily sorting, manipulating and combining datasets must be weighed against the speed of processing in the microcomputer environment, however. SAS may not be as efficient as other software for large scale permanent systems. The advantage of decreased system and program development time could be allayed by the increased processing time in some situations. However, in our case, once analyses files were created, little further data processing was necessary. and efficient programming offset some of the inefficiencies associated SAS. VIII. Present and Future Changes Structures and Procedures to Data In processing data from the baseline visit for this study, many possibilities for improvement were recognized. Through the expertise of systems analysts and medical information resources at Mayo Clinic, major changes to the Olmsted County database design have been made and conversion of baseline data to the new data structure is in progress. Basically, the new design eliminates redundant storage of similar information, normalizes all data structures, and simulates a more relational database structure. Smaller files containing similar information regardless of source (in-home vs inclinic data collection) will be used in storing data, and should increase efficiency of storage as well as processing. With long-term follow-up visits planned, such an approach was mandatory. SAS can be inefficient in processing large datasets, both in terms of work (disk) space and memory. The default sorting routine requires disk space equivalent to twice the size of the file due to the sort in place algorithm used. SAS does not use expanded memory well. SAS/Graph® can be difficult to implement when large files or complex graphs are involved due to insufficient memory, even if executing in batch mode. Excessive processing time in executing certain procedures can be frustrating without some type of multi-tasking capability. In addition, an on-line SAS data dictionary is being developed by Mayo Clinic. Information pertaining to each individual variable will be available through this facility. Details of variable infonnation such as length, type, label and format will be stored in addition to a brief textual description and history of the variable. The data dictionary will also be used to document "key" variables, which may be used to "join" tables of information. Such on-line documentation available both at Mayo Clinic and in North Carolina will be valuable during data entry, processing and summarization stages. A foreseeable problem in the use of SAS is the impact of version changes. Since SAS is an interpreted language, version changes where procedures are no longer supported could result in lost capabilities and the need to reprogram parts of the system. Storing compiled SAS code would reduce the impact of changing versions, but was not practical in our case. The use of database management tools such as SQL will aid in efficient data retrieval for statistical analysis. 323 Acknowledgements: The authors would like to acknowledge the BPH Natural History Study Group, particularly Dr. Harry Guess for his leadership in instigating the project and chairing the above group, Ms. Rebecca Nelson, for her diligent supervision of data collection and entry at Mayo Clinic to ensure high quality data. Since the new data structures simulate a relational database, the use of relational database concepts and tools is a simple extension. The advantage of creating and storing analysis views through Proc SQL is that even the novice user can join data without having to know details necessary for merging data. In addition, the code to "join" data is executed consistently through SQL, avoiding potential errors when several analysts are merging data. Once a view is implemented., analysis files are always available and up-to-date even though data may have been recently updated. Although the SQL code defining the "view"is executed at run-time, such code is more efficient than merging, particularly when severa! analysts may be implementing similar code. SAS, SAS/ACCESS, SAS/AF, SAS/ASSIST, SAS/FSP, SAS/GRAPH, and SASIMACRO are registered trademarks or trademark of SAS Institute, Inc. in the USA and other countries. IBM is a registered trademark of International Business Machines, Inc. References Barry, Micbael (1990). Epidemiology and Natural History of Benign Prostatic Hyperplasia. Urol. Clinic. North Am. 17: 495-507. The above changes when combined with creating analysis file views will eliminate essentially all the data processing previously necessary for incorporating Olmsted County data into final analysis files. In the final combined database structure stored in North Carolina, redundant storage will be eliminated where feasible, and the database structure will mimic that at Mayo Clinic. Data from Scotland will undergo some minor translation of scale discrepancies, and be converted to this same structure. The new data structure is projected to be much more efficient in terms of both disk space and memory required for processing. The Merck Manual (1987). Berkow, Robert (editor-inchief). Rahway, NJ: Merck & Co. p.1635-1636. Guess, H.A. Antecedents and manuscript. Prostatic History. Hyperplasia: Unpublished Kurland, LT, Molgaard, CA (1981). The Patient Record in Epidemiology. Scientific American 245(4): 54-63. More standardized macros will be developed in the future to handle the more COmmon tabular or graphical summarizations and analyses. In addition. a menudriven system for data retrieval, summarization and analysis is being investigated. A system developed in SASI AF®, or potentially SASI ASSIST® would give better data access to non-programmers and nonstatisticians. Pre-programmed tabulations, and a limited choice of statistically valid analyses may be allowed as options in such a system. IX. Benign Natural Epstein, Robert S., Deverka, Patricia A., Chute, Christopher G., Lieber, Michael M., Oesterling, Joseph E., Panser, Laurel, Schwartz, Skai W.,Guess, Harry A., Patrick, Donald. (1991). Urinary Symptoms and Quality of Life Questions Indicative of Obstructive Benign Prostatic Hyperplasia. Supplement to Urology 38(1), 20-26. Garraway, W.M.,Collins,G.N.,Lee, R.J. (1991). High Prevalence of BPH. Lancet 338: 469-471. Summary Birkett, Thomas (1988). Dealing with Limitations of SAS/Stat® Software on a Pc. Proceedings of the Thirteenth SAS Users Group International Conference, Cary, NC: SAS® Institute, Inc., 691-694. The design of databases for large epidemiologic studies is challenging. particularly where multi-national sites are involved. The authors have found that more error resolution and data processing closer to the source of the data promotes accuracy and data integrity. Processing such large-scale studies in a microcomputing environment is feasible through normalized and nonredundant database design, more efficient storage and optimized use of PC SAS. Horwitz, Lisa and Thompson, Tim (1988). Efficient Programming with the SAS® System for Personal Computers: Save Time, Save Space. Proceedings of the Thirteenth SAS Users Group International Conference, Cary, NC: SAS® Institute, Inc., 625-630. Howard, Neil. (1991). Efficiency Techniques for Improving I/O and Processing Time in the Data Step. The Rochester Project was supported by NIH grant ARJ0582. 324 Proceedings of the Sixteenth SAS Users Group International Conference, Cary, NC: SAS® Institute, Inc., 284-289. Davis, Gordon and Sweetland, Scott. (1987). Efficiency and Performance Considerations for the SAS® System Under PC DOS. Proceedings of the Twelfth SAS Users Group International Conference, Cary, NC: SAS® Institute, Inc., 779-787. Table 1 Overview of BPH Natural History Community Studies in Men Aged 40-79 Componentl Measurement Olmsted County Mayo Clinic Community Visits an-home Component) Sample description Random Sample n=2119 Number of participants Measurements • Urinary Flow Rate First 2 Centers Full Community n=1627 Scotland Third Health Center Full Community n=331 ,/ ,/ ,/ ,/' ,/' ,/ ,/' ,/' • Questionnaire Symptoms Quality of Lifelother .Family HistorylDaily Medication Clinical evaluation Sample characteristics Number of participants Measurements .Ultrasound .Digital rectal examination .Symptoms oUrinary flow rate • Pbysical Exam oUrinalysis oBloodwork ,/ Random sample of Participants who community participants screened positive n=471 n=492 Full community n=331 ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ ,/ 'Symptoms solicited via screening questionnaire completed at in-home visit; other information in a mail-based questionnaire. 325 Mayo Clinic Random Sample/ Medical Screenin .. © Data Entry SAS/FSP with Range Checks ~ ~eoo7I Scotland Screening Medical History Errors H Data Entry keypunch Resolve Errors (Down""' !O " ( diskettes files '" Consistency Check R Figure 1 ..... PC :creening. Consistency Backup data on diskette and Range Checks I\) 0> '" Load data on hard disks UNC Screening. Consistency Checks File Conversion: Combined Databases Error Conversions Norrnalizatio Translation Clean Reports A' -= ~r;;;;;;;r cn!;~ ( UroHow \ .,.,,,, \~~~ ~ra;;;;;;;;;Jr.=:J ( (Dail;7. ~=#( (~d (urOfIOW( ~Famitt History ~ summary & Anattsis ~~~