Download PreparingAnalysisDatasetsKenyaSep2008

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Data assimilation wikipedia , lookup

Transcript
PREPARING DATA FOR
STATISTICAL ANALYSIS



Data Cleaning
Dataset Preparation
Documentation
9 September 2008
Beverly Musick
Indiana University
Raw Data Cleaning
For data that are stored in Access, Excel, or text files data cleaning should
begin with the original table, spreadsheet or file.

Back-up the original data files.

Eliminate blank records and any records used for testing.

Locate duplicate records and resolve.

For numeric variables, identify outliers by sorting and reviewing the overall
minimum and maximum. This is particularly useful for continuous variables
such as dates, ages, weights etc.

For categorical variables such as gender or travel time to clinic, sorting will
reveal invalid response codes or use of mixed case (f, F, m, M for gender).

Can also assess the amount of missing data when records are sorted. Does
it make sense that x records have no value for variable y?
Raw Data to SAS Datasets
Create a SAS program that converts the database file(s) to permanent SAS
dataset(s).

For Access or Excel files can use ‘Proc Import’
PROC IMPORT OUT= WORK.demog
DATATABLE= "tblDEMOG"
DBMS=ACCESS REPLACE;
DATABASE="I:\Projects\Kenya\CFAR\cfar.mdb";
dbpwd=‘password' ;
RUN;

For text files can write specific input statement
data copd ; infile 'c:\kenya\hiv\copd.txt' ;
input @1 patientid $9. @@ ;
run ;
Raw Data to SAS Datasets (cont.)

Merge or append (concatenate) tables as necessary.

Double-check the merging process by looking at the number of observations in each
dataset before and after the merge.
831 data visit ; set h.hivvisit2(keep=patientid apptdate age weight height bmi cd4) ;
832 if patientid in ('1271BS-1','26277-4','3280CH-4','4709KT-6','625NT-5') ;
833 run ;
NOTE: There were 933654 observations read from the data set H.HIVVISIT2.
NOTE: The data set WORK.VISIT has 71 observations and 7 variables.
843 data vis2 ; set h.hivvisit2(keep=patientid apptdate clinic hgb sao2) ;
844 if patientid in ('13836MT-4','4709KT-6','625NT-5') ;
845 run ;
NOTE: There were 933654 observations read from the data set H.HIVVISIT2.
NOTE: The data set WORK.VIS2 has 46 observations and 5 variables.
846 data bothvis ; merge visit vis2 ;
847 by patientid apptdate ;
848 run ;
NOTE: There were 71 observations read from the data set WORK.VISIT.
NOTE: There were 46 observations read from the data set WORK.VIS2.
NOTE: The data set WORK.BOTHVIS has 83 observations and 10 variables.

The number of records is dependent on the overlap among the datasets. This relationship
should be known in advance and the expected outcome confirmed.
Raw Data to SAS Datasets (cont.)

Confirm that the total number of variables in the merged
dataset is correct.

The number should be the sum of all variables minus the
(number of key fields * (number of datasets in merge
minus 1)).
In the previous example: 7 + 5 – 2*(2-1) = 10

If the number of variables is less than this, then you
know that you have the same variable(s) in one or more
of the datasets. This should be strictly avoided.
Raw Data to SAS Datasets (cont.)

Investigate messages such as
 "NOTE: MERGE statement has more than one data
set with repeats of BY values."
 “Variable _____ is uninitialized”
 “Variable _____ has never been referenced”
 “Character values have been converted to numeric…”
 “Variable _____ has been defined as both character
and numeric”
 “Warning: Multiple lengths were specified for the BY
variable _____ by input data sets. This may cause
unexpected results.”
SAS Dataset Creation
To create permanent datasets for analysis:

Recode missing values used in the raw data tables/files to
appropriate SAS missing values. For example, if 9's were used to
indicate missing data for numeric fields in a data table then these
should be converted to .'s.

Calculate appropriate summary scores (ex. AUDIT-3, BMI)

Calculate differences between dates such as time from enrollment to
ART initiation.

Label all calculated and created variables.

Attach formats to the variable values where necessary.
Cleaning Data in SAS
Create a cleanup program.

Generate frequencies, means, and univariates to better
understand the dataset and to check for invalid data.

Plot the data.

For the numeric and date fields look at minimums and
maximums to verify all values are within expected range.

Locate duplicate records and resolve.

Compare fields when appropriate (i.e. dob and age,
confirm date of initial visit < date of follow-up).
Cleaning Data in SAS (cont.)

Identify important fields such as summary
scores and verify their values.

Merge all longitudinal datasets to identify date
inconsistencies, variable format inconsistencies,
and to locate missing questionnaires.

Merge cross-sectional (demographics) dataset
with longitudinal datasets to identify subjects in
one but not the other.
SAS Program Files
Save all logs and outputs from SAS programs
especially when creating analysis datasets for
publication
 Naming conventions – studyx.sas, studyx.log,
studyx.lst
 Only the program that generates the permanent
dataset should overwrite it.
 Never overwrite a permanent dataset (even with
a proc sort) from any other program.

Documentation
Internally document SAS programs. At minimum
include file name, location, purpose, author,
date, and revisions.
 May be helpful to include the names of any
permanent SAS datasets created within the
program.
 All SAS printouts should have at least one title,
which includes the project name. (“title”
statement)
 It’s helpful to use the footnote option to display
the path and file name of the SAS program on
the listing. [EX: options footnote
‘I:\alz\clin\cperm.sas’ ; ]

Documentation (cont.)

If any variable values have been
formatted, include a copy of the “proc
format” section in the documentation.

Generate form keys.

Provide a description of any variables
included in the datasets that are not found
on the form keys.
Documentation (cont.)

Detailed algorithms of how summary scores are calculated should
include the following:
a. which variables are used to calculate which summary scores
b. which variables (if any) are recoded and how
c. what is the minimum number of non-missing items needed
to calculate the score
d. how are missing values addressed. Typically when
calculating a total or sum score the mean should be imputed for
missing data. If the summary score is a mean itself then the
missing data can be ignored. In both of these cases it is essential
that c. above is followed and that summary scores are coded as
missing if there is insufficient data to calculate.
e. what is the meaning of the score and how is it scaled.
Indicate the possible range and how a high score differs from a low
score. For example include something like “Higher score indicates
more depression”.
SAS General Notes

If the study is longitudinal, at least two datasets are
needed: one containing the demographics and other
information which does not change over time; and one
containing the data for multiple time points.

Never put cross-sectional variables such as gender in the
longitudinal dataset.

Format all date fields with 4-digit year (ddmmyy10. or
date9.)

Choose data type numeric whenever possible.
Distributing SAS Datasets
After a senior data manager has reviewed the datasets and documentation, the
statistician should be given READ ONLY access to:

The form keys

All appropriate SAS datasets (should have the extension .sas7bdat)

A description of any variables included in the datasets that are not found on
the form keys

Notes on calculation of the summary scores

Proc format statements

Any other documents or notes which would further explain the data.
Distributing SAS Datasets (cont.)
Statisticians should not be given nor have access to:
 Any Protected Health Information (PHI) such as study
subject’s name, address, phone numbers, social security
number, hospital id number. Date of birth should only be
included if absolutely necessary. But usually age can be
calculated and given instead.

Your SAS generation programs. These often contain
PHI. If you must share SAS programs with the
statisticians, please carefully review the programs and
then copy to a separate folder to which they have read
access rather than giving access to your main folder.
Distributing SAS Datasets (cont.)
For your own records at minimum, you should have:

A copy of everything you give to the statistician and the date given.

A copy of the log of all the SAS programs especially those that create any
permanent SAS datasets which were passed along to others

Grant protocols, meeting notes, scoring algorithms, instructions for data
entry, corrections made, etc.

It may be helpful to maintain a subdirectory that exactly mirrors the
subdirectory of the pc where the data is actually being entered. This
subdirectory would include all the RDMS programs, format files, and tables.

For longitudinal studies in particular, it is important to archive datasets and
SAS programs/logs, which were used for analysis for abstracts, papers,
grant proposals, and other publications.
Organizing Project Folders

Example of folder structure:
– I:\projects\studyname – contains raw data, documentation, SAS
programs, etc.
– I:\projects\studyname\Datasets – stores datasets that have been
approved for distribution. May also include the SAS formats in this
folder. Statisticians should have READ ONLY access to this folder.
– I:\projects\studyname\Keys – stores the form keys, the scoring
algorithms and other data documentation. Statisticians should have
READ ONLY access to this folder.
– I:\projects\studyname\Grant – stores the original grant application,
protocols, papers, etc. All data management staff and statisticians
involved in this project should have full access to this folder.
DM Working with Biostatisticians
Attend study meetings
 Date all documents and meeting notes
 Comment on proposed study changes
 Understand the statistical analysis plan
 Review statistical reports (preferably
before presented to research team)
 Review and critique abstracts/manuscripts
