Download Generation of a Clinical Data Warehouse across Multiple Companies

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Generation of a Clinical Data Warehouse across Multiple Companies
Sarbjit Rai, Genentech, Inc., South San Francisco, CA
be resolved. Minutes were taken at each meeting which
documented actions required and decisions made.
Abstract
Generation of a clinical data warehouse and the issues
which can arise when collaborating across multiple
companies. What is a data warehouse, why do we need it,
how is it put together, software and standards used,
common issues which arose and solutions which were
utilized to successfully generate the final product.
Data Warehouse Charter
As one of their frrst tasks the team put together a data warehouse
charter which was a WORD document used to identil)' what
needed to be done and how, in order to produce a successful
data warehouse product This charter was a live document which
was regularly updated by team members as new ideas, rules,
issues or timelines were discussed. All information regarding the
structure and content of the data warehouse and what it
would/would not include was documented in the Charter. This
document was constantly reviewed and referenced and proved
very useful to the team during the development of the project.
Introduction
The Clinical Data Warehouse project was a joint development
project between three Pharmaceutical and Biotech Companies
from the United States (East and West Coast) and Europe. The
aim of this data warehouse was to enable all three companies to
store and share information and integrate clinical data for
submission purposes. Each company conducted their own
clinical studies on drug X according to their own SOPs and
company processes. Since each company had different clinical
databases in terms of structure, content and format a Common
Data Model was deemed necessary to enable integration of data
for USA and European submission filings. The common data
model consisted of SAS datasets which were in a standard
format agreed upon up front by all three companies.
Data Warehouse Structure and Content
This data warehouse was essentially a web based system which
resided in one location (company X) but could be accessed by
all team members across companies. It consisted of the
following three components:
•
•
•
The data warehouse hence consisted ofSAS data and supporting
documentation from studies conducted by all three companies
which could be subsequently accessed by all three companies
and integrated for submission purposes. The data warehouse
project essentially consisted of three components:
•
•
•
Description and documentation
Clinical Data (SAS data sets)
Data displays (Listings, Tables, Figures)
The documentation essentially consisted of excel spreadsheets
and word documents covering:
•
Development of the Data Warehouse
Development ofSAS programs and loading of test
data to ensure the process actually worked
Final programming and loading of real data into the
Data Warehouse once studies are complete
•
The format and content of all the datasets in the
common data model
Supporting documents (Protocols, annotated CRFs,
Statistical analysis plans) fur each study
The clinical data consisted of the following three components:
•
•
•
This paper will focus on the key elements involved in the
development of a clinical data warehouse, the issues and
challenges that the team experienced during project
development, tools and processes used and lessons learned.
Company specific study datasets
The integrated datasets
The filing datasets
All data and documentation was stored in individual study
directories within each company directory.
Team Members and Meetings
The company specific datasets were the original SAS datasets
generated by each individual company according to their own
structure, format and SOPs and used for individual study
reporting. These data sets did not require any additional
manipulation prior to their transfer into the data warehouse and
were copied into the data warehouse (with supporting
documentation) on completion of the study.
A data warehouse team was put together and consisted of
statisticians, programmers and data management staff from each
company. The team initially met at one site for a start-up
meeting and then either weekly or monthly via teleconference.
The initial face to face meeting was set up to allow everyone to
meet the team, identicy what data was to be collected, put
together initial ideas of what the data warebouse would consist
of in terms of structure and content and to set initial timelines
for completion of the work. At least one programmer and
statistician was assigned from each company to work on this
project and act as the primary company contact for all queries by
the team. This team met regularly to discuss project status,
identil)' the team goals, review timelines for completion of
specific tasks and identil)' any ongoing issues which needed to
The integrated datasets were the new datasets which were
mapped by each company according to the common data model
to allow for integration of the data for submission purposes.
Twenty seven data sets were identified in total by the team (eg.
demographics, labs, medical history, etc) to be mapped from
individual study data and included in the common data model.
This data consisted of all safety and efficacy data which was
288
collected within the various studies and a large patient dataset
(the PAT file) which contained mainly derived variables
required for statistical analyses. Where possible efforts were
made to ensure there were minimal differences between the
derivations for similar variables across studies and companies.
content. Occasional video conferences and face-to-face meetings
were also held once a specific development milestone had been
met.
The filing datasets were the final datasets generated by each
company from the integrated datasets for submission purposes.
These did not have to have a common structure and could be
company specific.
The specifications for each common data model dataset were
documented in a separate excel spreadsheet (eg. one for
DEMOG, one for LAB, etc). All excel spreadsheets contained
company specific information aswell as the common data model
information. The following information was initially
documented in excel by each company for each data set:
Common Data Model Structure
The data displays consisted oflistings, tables and graphs from
the individual studies and submission files.
•
•
•
How the Team Worked
The main bulk of the work involved developing specifications
for the twenty seven datasets going into the common data
model. Communication was done mainly via email and phone
calls with teleconferences being held bi-monthly. Initially
programmers and statisticians had separate teleconferences to
discuss their own specific data issues and develop their own data
sets for the common data model (the three main efficacy data
sets were developed by the statisticians and the remainder data
sets were developed by the programmers). Later joint meetings
were held to discuss status of the common data model and come
to agreement on common issues regarding data format and
The following specification were then agreed upon by the team
for the common data model:
•
•
•
•
8
8
$40
DATE9.
Label
·Patient
Age
Race
ddmnun....mi_ Date
Variable Name
'J'n>e/L~b
Label
$17
Clinical Stud_y
CENTER
$8.
Center Number
PATNUM
8
Subject ID
AGE
8
Age in years
SEX
$6
Sex
RACE
$40
Race
EVALE
$3
Per Protocol Po])lllatioo
EVALS
$3
Safety Evaluable Populatioo
EVALR
$3
m
FSTIXDC
$9
Date of First Stud..L_Dru_g_
TR1N
8
Treatment Grou_ll_Number
TRTC
$80
Treatment Group Text
Name
Genentech
Label
Derivation
Name
CompanyB
Derivatioo
Label
In addition the following formatting rules were used across all
datasets for dates and times:
A number of core variables were agreed by the team to be
included in all datasets. These included FDA required variables
(study, center, patient, age, sex, race) and some additional
variables deemed necessary by the team (eg. start date of study
medication). Where possible the team tried to map like variables
with like across the three companies to save on space and ensure
easy integration of the data. The core list included the following
variables:
STUDY
SAS Variable Name
Type/Length
Format
SAS Label
The following standard template was used for all specifications:
Common Data Model
Format
~
Name
Patnum
Age
Race
Oat
SAS Variable Name
SAS label
Derivation (if applicable)
Variable name
Leuetb
Remark
XxxDC
$9
Character date, format
XxxTC
$8
DDMMMYYYY
Character time format llli:MM:SS
Only the original date that was collected on the CRF was to be
stored in these fields. For Genentech this meant stripping out
any default days or months before including the date variable in
the common data model since at Genentech all missing days and
months are defaulted to 151h June in ORACLE CLINICAL.
Efforts were also made to come up with a standard naming
convention for the variables but it became clear as the data sets
were being developed that it would be easier for the
programmers from each company to use their original names
where possible in at least some of the datasets since most of the
data in the common data model would be converted back to
company specific names and formats before submission
programming began to allow each company to use their standard
in-house code for submission programming. The naming
conventions used across datasets were therefore not always
consistent since some datasets used Genentech variable names
for the common data model and others used Company X or
Company Y variable names.
Populatioo
289
complications in becoming familiar with an older study which
the programmer may not have originally worked on and
familiarizing themselves with study design, formats, derivations
and original programming rules which sometimes required
investigation and added an additional learning curve to an
already complex task.
Process for Completion of the Work
Each excel specification document was started by a programmer
in one company then passed onto a programmer in another
company via email to complete his/her company specific
sections. Once all three companies had added their individual
study specific information the cormnon data model for that
particular dataset was generated. One person was assigned to
develop the common data model for a dataset.
Updating the excel documents by all three companies was also a
challenge since it involved passing excel sheets from one
company to the next via email and remembering who had the
latest version.
The work was divided equally between the three companies so
that each programmer (or statistician) was responsible for
generating XXX number of datasets. The other two
programmers were then asked to review and comment if they
had any queries regarding the common format. Once a specific
dataset had been completed by all three companies, the primary
programmer and statistician from each company was required to
review, approve and sign off the specification via email. The
final specification was then posted to the shareweb repository
and mapping programming and testing of the aetna! process
could begin.
The data warehouse was physically located within a shareweb
system at one company only. Access was restricted to a specific
group of people in statistics and data management dealing with
the exchange and analysis of data (essentially the data
warehouse team). Difficulties were initially faced by data
warehouse team members in the other two companies in
obtaining authorization and access to the system.
Working in three different time zones also made meetings and
fast resolution of issues a challenge, since one person would be
asleep or at home whilst another was working. However this
sometimes worked in our favor since it meant different people
could work on the same documents at different times and then
pass it on to the next person in a logical fashion.
Issues and Challenges
The main challenge which the team faced during development
of this project was in determining the data warehouse structure
and content and coming to agreement on the standard naming
conventions, formats, labels and content of the twenty seven
SAS data sets. The aim was to minimize the programming work
required by all three companies without impacting quality.
Hence produce a viable end-product that could be used by all
three companies for exploratory analyses and submission
purposes.
Lessons Learned
This project is still ongoing. Initial data specifications have been
completed and testing I programming is now underway. Some of
the lessons learned so far have been documented below.
The importance of documenting rules and algorithms used in
study reporting (particularly older studies) was clearly shown to
be invaluable and something which can hinder development at a
later date if the data is required for future analyses (or integrated
reporting. The importance of standards and good programming
practices was also highlighted. In particular having industry
wide standards for variable names, formats, labels etc would be
helpful for futnre data warehouse projects across international
companies. Efforts are already underway in the industry to
collaborate with the FDA to come up with standards in this area
however this in itself would be another topic to present on
another date!
Each company had its own CRFs (case record forms) which
collected data for each study .in a particular format. In addition
to the normal differences between studies wbicb a programmer
faces when trying to integrate data from many studies within one
company we were faced with the additional challenge of
differences between the three companies in:
•
•
•
•
Databases used (structure and content)
Data dictionaries used (COSTART, WHOAE,
MEDORA)
Programming standards, naming conventions, versions
ofSAS
Formats, algorithms and derivations used for similar
variables
The development and maintenance of the data warehouse charter
proved useful throughout this project in acting as a reference and
reminding team members of what needed to be done and when.
A decision had to be made regarding which data dictionary we
would use to code the medications for all studies going into the
data warehouse to ensure the data could be integrated if
required. Numerous discussions were held regarding version
control of the dictionaries used to ensure consistency across
companies and whether it would be easier to have one company
code all the data or if each company should do their own.
Having initial face to face meetings allowed the team to meet
one another and form bonds which provided a useful basis for
teamwork and to build team rapport which can sometimes be
difficult if you are only in touch via email. Also having bimonthly teleconferences ensured regular communication with
team members to review status, upcoming timelines and discuss
any issues which could not be dealt with via email.
Each company also had several clinical studies from phase I
(small scale volunteer studies) to phase III (large scale pivotal
trials) which were in various stages of development. Some of
these studies were ongoing and others were already complete.
For ongoing studies programmers were already familiar with the
study design, content and programming making the mapping to
the common data model a fairly stnrightforward task. For other
studies that were already complete there were added
In hindsight use of a standard naming convention for all SAS
variables would probably have made this a more portable system
which could then be sent to the FDA (or other external
customers) if required. Finally planning upfront and having team
collaboration through! out was essential and is important for the
successful development and completion of any multi-company
project.
290
Contact Information
Your comments and questions are valued and encouraged.
Contact the Author at:
Sarbjit Rai
Genentech Inc.
I DNA Way
South San Francisco, CA 94080-4990
E-mail: [email protected]
Phone : (650) 225 4629
291