Download Data Transfers Across Diverse Platforms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expense and cost recovery system (ECRS) wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data model wikipedia , lookup

Data center wikipedia , lookup

Forecasting wikipedia , lookup

Data analysis wikipedia , lookup

Information privacy law wikipedia , lookup

Data vault modeling wikipedia , lookup

3D optical data storage wikipedia , lookup

Business intelligence wikipedia , lookup

Transcript
Data Transfers Across Diverse Platforms
Denis Michel, PharmaNet Inc., Princeton, NJ
ABSTRACT
SAS® programs written on one system often have to be
run on a different system. This issue has mandated that
programmers write portable SAS code that can easily be
modified for different platforms. Data and output are also
transferred routinely. In the pharmaceutical industry, we
transfer various file types, including data, programs, and
output, among global affiliates, other companies (e.g.
contract research organizations) and regulatory agencies
(e.g. FDA). The systems and policies in place at the
organizations sending and receiving these files are
diverse.
Informats are not required unless the data are to be
updated after being ported.
User-defined formats are to be avoided as much as
possible in portable data sets. While these formats are
valuable in entry data sets, they may cause problems in
transferred data. By default, SAS issues error messages
and stops processing if it encounters variables with
unknown formats associated with them. In order to read
the data and change the error to a note in the SAS log,
use the following statement.
options nofmterr;
This paper focuses on the issues involved in transferring
files across these diverse platforms from a pharmaceutical
SAS programming perspective. These issues include SAS
transport files, FDA Guidelines for electronic submissions,
converting SAS files to other formats, documentation of
data contents, transfer media, and writing SAS programs
for portability.
If user-defined formats must be stored as variable
attributes, either the SAS format catalog or the SAS PROC
FORMAT code is provided with the data sets. It is
preferable to use the PUT function to assign values to
character variables based on user-defined formats.
INTRODUCTION
•
•
•
Why do SAS programmers routinely transfer files? There
are several benefits of data transfers over sending printed
output to our customers. Electronic files facilitate sharing
of work. Pharmaceutical companies are able to outsource
parts of large projects by transferring data and programs
to and from contract research organizations, laboratories,
and global affiliates.
Electronic data transfers also enhance data review.
Regulatory agencies, such as the FDA, accept electronic
submissions so that the reviewers can analyze data. This
may or may not reduce the review time, but it certainly
eliminates the “dead” time in waiting for the sponsor to
provide paper answers to requests.
The most important reason for electronic transfers is that
our customers have become sophisticated in their use of
software, including MS Excel, MS Access, and SAS/JMP.
They prefer to do their own analysis of data using their
favorite tools.
THE STARTING POINT
The starting point of a data transfer is a SAS data library
containing organized SAS files. The SAS data sets to be
transferred should have attributes that are consistent with
portability. The order of the variables should be in logical
sequence. For example, a data set whose observations
are uniquely identified by patient, visit, and date should
have these three variables first. Each variable has a
descriptive label. SAS date and time variables are
formatted using SAS standard formats (e.g. DATE. or
MMDDYY. for dates, TIME. or HHMM. for times).
In summary, portable SAS data sets have the following
characteristics.
•
•
Variables in logical order
Variables labeled clearly
Date and time variables are numeric with standard
SAS formats
No informat attributes
No user-defined format attributes
Adherence to these standards can easily be verified using
PROC CONTENTS. The POSITION option is useful to
confirm the order of variables.
SAS TRANSPORT FILES
SAS files (data sets and catalogs) need to be copied to
transport format in order to be read by a different operating
system. There are two different methods of creating SAS
transport files, the XPORT engine and the CPORT
procedure.
The XPORT engine was introduced in SAS version 6 and
is used with the COPY procedure. PROC COPY can be
used to copy individual data sets to transport files as
discussed in detail below in the context of FDA guidelines.
The code below copies all data sets in a permanent
SAS data library to a SAS transport file.
consequences of different decimal precision across
platforms.
*** data library reference;
FDA GUIDELINES
libname saslib ‘\\server\folder\data subfolder’;
Clinical programmers in the pharmaceutical industry
routinely transfer data to FDA. Guidance documents are
available to download as PDF files from the web site
www.fda.gov. The Center for Drug Evaluation and
Research (CDER) and Center for Biologics Evaluation and
Research (CBER) co-published the document “Guidance
for Industry: Providing Regulatory Submissions in
Electronic Format - General Considerations” in January,
1999.
*** transport file name;
libname xlib xport ‘\\server\\folder\transport
subfolder\xport.xpt’;
***copy entire data library to transport file;
proc copy in=saslib out=xlib memtype=data;
run;
SAS issues the following informational note when copying
data to the XPORT engine.
NOTE: BUFSIZE is not cloned when copying across
dissimilar engines.
System Option for BUFSIZE was used.
A separate PROC COPY statement should be used to
copy SAS catalogs, such as format libraries. That is, SAS
data sets and SAS catalogs cannot be copied into the
same SAS transport file using the XPORT engine.
PROC CPORT can be used to copy both SAS data sets
and SAS catalogs into a transport file. The syntax is
different from the XPORT engine method as shown by the
code below.
*** data library reference;
libname saslib ‘\\server\folder\data subfolder’;
*** transport file name;
filename tranfile
‘\\server\\folder\transport subfolder\xport.xpt’;
*** Copy the data library to transport format;
proc cport library=saslib file=tranfile;
run;
Note that PROC CPORT uses a FILENAME statement to
define the transport file instead of a LIBNAME statement.
Listed below are important issues concerning SAS
transport files.
•
•
•
The files must be moved using a binary transfer
protocol. An ASCII transfer would corrupt the data.
SAS increases the lengths of all numeric variables
with lengths of less than 8 bytes by 1 byte.
Numeric variables containing decimal values are
subject to losing precision in transport format. The
ROUND function can be used to minimize the
Another important CDER document for clinical
programmers is “Providing Regulatory Submissions in
Electronic Format - NDAs” published in January 1999. The
corresponding CBER document, revised in November
1999, is “Providing Regulatory Submissions to the Center
for Biologics Evaluation and Research (CBER) in
Electronic Format – Biologics Marketing Applications”. The
submission of SAS data sets is discussed in both of these
documents under Item 11 – Case Report Tabulations
(CRTs). CBER has another important document, “CBER
Computer
Assisted
License
Application
(CALA)
Questionnaire”, also available on the web site.
The FDA guidance documents provide the framework for
interactions with the agency reviewers. Some of the
specific guidelines related to SAS data transfers are
summarized below.
•
•
•
•
•
•
•
SAS System XPORT Transport Format (also known
as Version 5 SAS Transport Format)
PROC COPY using XPORT engine
File extension should be .XPT
Do not use PROC CPORT
Do not compress
Single .XPT file for each SAS data set
Generally, files should be maximum 25 MB, with large
data sets divided logically
The last point above concerns large data sets. If the
reviewer requests dividing a large data set, this should be
done in a way that is logical to the reviewer. For example,
if the data set contains laboratory data, a logical division
may be to separate hematology, blood chemistry, and
urinalysis. Additional divisions can be discussed with the
reviewer to maximize the efficiency of the review.
The data provided to FDA should be accompanied by data
definition tables, which document the data contents. The
data definition tables are provided as a single document
named define.pdf. The following columns are contained in
the tables.
•
•
•
•
•
Variable (variable name up to 8 characters)
Label (descriptive label up to 32 characters)
Type (number, character or date)
Codes (description of codes and decodes used)
Comments (further description of variable)
Typically, the data provided contain raw variables
(collected on the case report form, e.g. dose date), and
derived variables (calculated, e.g. duration of treatment).
The comments column would provide the case report form
location for raw variables, or the method of calculation for
derived variables.
The
guidance
documents
provide
media
recommendations. Small submissions (less than 10 MB)
can be provided on 3.5-inch floppy disks. Submissions
less than 3.25 GB can be provided on CD-ROM. Large
submissions should be provided on digital tape.
AUTOMATED TRANSPORT FILE MACRO
The SAS macro below automates the creation of individual
SAS transport files for all data sets contained in a SAS
data library. The macro reads the SAS data set names
from PROC SQL using DICTIONARY.TABLES and then
uses CALL EXECUTE to write the individual transport
files.
%***FDA format - 1 transport file per data set;
%***Specify: libref (SAS data library reference);
%***
xpref (Directory location for xport files);
%macro mcopyfda (libref=, xpref=);
%***Get the data set names;
proc sql;
create table dsets as
select memname
from dictionary.tables
where libname=%upcase("&libref");
quit;
%***Write the transport files;
data _null_;
set dsets;
call execute("libname " || memname || " xport ' "
|| "&xpref.\" || trim(memname)
|| ".XPT'; "
|| "proc copy in=&libref out= "
|| memname || "; select "
|| memname || "; run; ");
run;
%mend mcopyfda;
The macro call below copies all data sets in the SAS data
library rawdata to the specified location on the server.
%mcopyfda(libref=rawdata, xpref=\\server\folder);
DATA CONVERSION
Clinical programmers often convert data between SAS and
other formats. Popular file formats include Excel
workbooks (.xls files) and ASCII files (.txt or comma
separated .csv files).
Excel files are very popular and easy to use, but they can
cause data conversion problems. Values can be lost or
truncated because each cell in a spreadsheet is
independent, so the Excel columns that convert to SAS
variables may contain a mix of numbers, dates, and long
character strings. The number of rows is limited. Excel 5
has a maximum of 16,384 rows. The maximum rows in
Excel 97 are 65,536. So, a SAS data set with 70,000
observations would lose records in a data conversion to
Excel. Comma separated ASCII or database files (.dbf)
are preferred.
Data conversion software includes the SAS Import/Export
Facility Wizard. This wizard has been provided in Base
SAS for Windows since version 6.12. The wizard supports
limited file formats such as .CSV and ASCII. If
SAS/Access to PC File Formats is licensed, additional file
formats are supported including .XLS and .DBF. The
wizard is run interactively, but the code generated can be
recalled in the program window and saved for modification
and submission.
Two software products are widely used in the
pharmaceutical industry for data conversion. DBMS/COPY
from Conceptual Software, Inc. and STAT/Transfer from
Circle Systems, Inc. provide data conversion between
SAS files and many file formats.
DATA DOCUMENTATION
Documentation of SAS data sets can be easily created
using PROC CONTENTS to write the variable name, label,
and type (character or numeric). Date and time variables
can be identified from the format attribute. This information
can be written to a text file with delimiters, and Microsoft
Word can read the text file. Once in Word, the “convert
text to table” tool can be used to generate nicely formatted
tables.
The following SAS code writes the text file using “$” as a
delimiter for data definition tables documenting the
contents of a SAS data library.
***File references;
libname sasdata ‘\\server\sasfolder’;
filename out ‘\\server\folder\define.txt’;
***Type of variables format;
proc format;
value ctype
1='number'
2='character'
3='date'
4='time';
run;
***Read data library contents and write a
***temporary data set;
proc datasets library=sasdata nolist;
contents data=_all_ noprint
out=work.tempcont
(keep=memname varnum name type format label);
quit;
***Sort by variable order within data sets;
***Variables should be in logical order;
proc sort data=tempcont;
by memname varnum;
run;
***Write text file with delimiter;
data _null_;
length ctype $10;
file out;
set tempcont;
by memname varnum;
if type=1 and (index(format,'DATE')>0 or
index(format,'MMDDYY')>0)
then type=3;
else
if type=1 and (index(format,'TIME')>0 or
index(format,'HHMM')>0)
then type=4;
ctype=put(type,ctype10.);
***Header for each data set;
if first.memname
then put
/ @1 memname ' Dataset Variables'
/ @1 ‘Variable $ Label $ Type $ Codes $ Comments';
***Upper case for first character of label;
***other text lower case;
label=upcase(substr(label,1,1)) ||
lowcase(substr(label,2));
***Write the delimited text;
***2 blank columns at end;
***for codes and comments;
put @1 name +1 '$' +1 label +1 '$'
+1 ctype $10. +1 '$ $ ';
run;
This program creates the text file, which is read into Word,
and then the table menu selection for converting text to
table is used to generate the data definition tables.
TRANSFER MEDIA
The transfer media are decided based on the policies of
the parties involved. Email is used for data transfers, but
some corporate policies do not allow data transferred over
the Internet due to security issues. If data are transferred
via email, the files should be compressed (e.g. using
WinZip) to prevent long download times. An alternative
approach to electronic data transfer is the use of a secure
dedicated line to transfer files. This method of data
transfer eliminates the security issues of email transfers
over the Internet, but involves additional cost and
maintenance by the information technology department.
Magnetic media can be used for data transfers. 3.5 inch
disks offer low cost, low capacity, and usually require data
compression. High capacity magnetic disks (Zip and Jaz)
are also available at higher cost. CDs are now common
data transfer media, offering high capacity (650 MB) and
low cost.
PORTABLE SAS PROGRAMS
In addition to the transfer of data, most customers require
that SAS programs are transferred in order to verify results
and modify analyses. It is therefore important to write SAS
programs that are portable to diverse systems.
Fortunately, SAS code is very portable. Statements
defining file locations (FILENAME, LIBNAME) are not
portable and should be avoided in report and analysis
programs. Instead, these operating system specific
statements can be coded in a single file for a project and
the file called with %INCLUDE statements. Alternatively,
these statements can be coded into an AUTOEXEC.SAS
program for each project.
COMMON PREVENTABLE MISTAKES
There are some common mistakes made in data transfers
that can be prevented. One common problem is sending a
virus to a customer, easily prevented using anti-virus
software. Moving SAS transport files across platforms
using an ASCII transfer corrupts the file. SAS transport
files should be moved using a binary transfer.
If sending a customer a compressed file, it is important to
assure that the customer has decompression software. If
not, a self-extracting file should be sent. Another problem
occurs if the method of creating the SAS transport file is
not specified (PROC CPORT or XPORT engine).
The final media should always be validated to assure the
transfer occurred according to specifications.
CONCLUSION
Data transfers continue to be an important part of clinical
data programming in the pharmaceutical industry.
Understanding SAS transport files, FDA guidelines, data
conversion tools, and data documentation are necessary to
be an effective SAS programmer in the current environment.
There are many opportunities for error in the data transfer
process. It is important to document and follow standard
procedures to avoid problems. Manual processes should be
automated to the extent possible, such as using PROC
CONTENTS instead of manually typing data definition
tables.
REFERENCES
Michel, Denis, “Portable Data Processing”, Proceedings of
the Tenth Annual NorthEast SAS Users Group
Conference, 1997.
U.S. Department of Health and Human Services, Food
and Drug Administration, Center for Drug Evaluation and
Research (CDER) and Center for Biologics Evaluation and
Research (CBER), “Guidance for Industry: Providing
Regulatory Submissions in Electronic Format General Considerations”, January 1999.
U.S. Department of Health and Human Services, Food
and Drug Administration, Center for Drug Evaluation and
Research (CDER), “Guidance for Industry: Providing
Regulatory Submissions in Electronic Format - NDAs”,
January 1999.
U.S. Department of Health and Human Services, Food
and Drug Administration, Center for Biologics Evaluation
and Research (CBER), “Guidance for Industry: Providing
Regulatory Submissions to the Center for Biologics
Evaluation and Research (CBER) in Electronic Format Biologics Marketing Applications”, November 1999,
Revised.
SAS Institute Inc., SAS® Companion for the Microsoft
Windows Environment, Version 6, First Edition, Cary, NC:
SAS Institute Inc., 1993.
SAS Institute Inc., SAS/ACCESS® Interface to PC File
Formats: Usage and Reference, Version 6, First Edition,
Cary, NC: SAS Institute Inc., 1993.
SAS is a registered trademark or trademark of SAS
Institute Inc. in the USA and other countries. ® indicates
USA registration.
Other brand and product names are registered trademarks
of their respective companies.
CONTACT
Denis Michel
PharmaNet, Inc.
504 Carnegie Center
Princeton, NJ 08540
Email: [email protected]