Download Warehousing Clinical Pharmacogenomics Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Pharmacogenomics wikipedia , lookup

Transcript
Warehousing Clinical Pharmacogenomics Data
Philip M. Pochon, Computer Task Group, Inc., Indianapolis, IN
Julie Heise, Covance CLS, Indianapolis, IN
Figure 1
Abstract
On June 26, 2000 Celera and the Human Genome Project
jointly announced that the first reading of the human
genome was complete. On Feb 13, 2001 the two
presented the first full genotype maps of the genome. For
the pharmaceutical industry there can be little doubt that
pharmacogenomics will be an ongoing revolution; and as
an emerging discipline, clinical pharmacogenomics
provides immense opportunities in the areas of drug
development and medical therapies.
In order to successfully integrate pharmacogenomics into
clinical trials, we must recognize the data management
issues that the nature of pharmacogenomic data and the
state of the science raise. These issues include:
•= What types and forms of pharmacogenomic
information must be managed?
•= How should we handle changing data in an emerging
discipline?
•= How is pharmacogenomic information to be used to
improve clinical trials?
•= How should pharmacogenomic data be presented to
clinical trial researchers?
•= How can we meet evolving government regulations
that define the patient’s genetic privacy rights?
The focus of this paper is not on resolving the above
issues, but on how the SAS® suite of tools (in particular
the SAS/Warehouse Administrator® tools) can be the
means for developing robust data management solutions
that will support the requirements derived from these
issues and any future pharmacogenomics concerns.
Therapeutic
Outcome
Responders
Patient Genotypes
This population sub-setting may be done prospectively
(during patient screening), retrospectively (to explain why
a study’s outcome was less than ideal) or over time (to
monitor viral mutations that confers resistance to the study
drug).
Pharmacogenomics starts with the DNA sequence for a
gene (or sequences for several genes) and selects key
locations that are likely to affect a patient’s response to a
drug. A SNP (Single Nucleotide Polymorphism) is the
term used for such a location. The values of the key SNP
(or SNPs) define the genotypes. Genotypes are then
grouped into phenotype response groups (Figure 2).
Figure 2
TG
Responder
Pharmacogenomics Overview
TC
Sequence
Pharmacogenomics at its broadest is the application of
genetic information to determine the right dose of the right
drug for the right patient. Pharmaceutical interest in
genetic information has initially focused on two areas:
•= Human genetic variability that affects metabolism of
a drug.
•= Viral genetic variability that affects resistance to a
drug.
The goal is to take a sample population and group it into
responders, non-responders and adverse responders
(Figure 1).
A T T G G C
Genotypes
AC
NonResponder
AG
SNP 1
SNP 2
GA
Adverse
Responder
Many genotypes of interest to pharmacogenomics are still
undefined, or being defined. Independent research may
add a SNP to the genotype at any time during or after the
clinical trial. The basic units that organize
pharmacogenomic data are still changing. Having
multiple analysis steps when definitions are still changing
has three major implications for any pharmacogenomics
data management system:
•= All levels of pharmacogenomic information must be
saved
•= The lower levels (raw machine data and sequences)
must be stored in their original formats, so
reinterpretation is possible
•= Completeness checks and version control of
pharmacogenomic information are critical.
Pharmacogenomic Data Types and Formats
DNA sequences are the core of pharmacogenomic data.
Sequencing equipment usually provides a patient’s
sequence in a vector form, and a graph of the observed
values from which the four-character genetic code (A, C,
G, and T) is derived. Patient genotype/phenotype
definitions derived from the sequence may be in a
structured data table, a structured data document or
embedded in formatted reports or graphs.
Any pharmacogenomics data management system must
thus be able to handle:
•= Structured data tables (SAS data sets)
•= Structured data documents (XML)
•= Unstructured documents (reports and graphs)
Due to the variety of file formats, graphs, reports, and
documents generated by lab equipment, warehouses must
have the ability to store unstructured data types. These
are not only the “raw” data for interpretation, but the basis
for reinterpretation should the genotype definition change.
The SAS/Warehouse Administrator Release 2 is ideally
suited to managing pharmacogenomics data.
Conventional databases are table based and require the
data be aggregated within the database itself. A SAS data
warehouse imposes no such restrictions. The key to a SAS
data warehouse is the use of metadata, which is data about
the data (see Pochon and Burger, 2000 for an in depth
discussion of metadata).
The Operational Data Definition (ODD) metadata layer of
the SAS/Warehouse Administrator is designed to define
and group various data types and storage modes.
ODD metadata defines:
•= the structure of inputs for the warehouse (Data Files
and External Files) and the process by which they or
their data will be brought into the warehouse
•= the structure of Data Stores (Detail Data and
Summary Data) within the warehouse
Load Steps can be created as a metadata record within the
SAS/Warehouse Administrator. A Load Step is a process
which reads data from one or more Data Files or External
Files, creates an instance of a Data Store and loads the
target Data Store. Import and Export engines for
pharmacogenomics data warehouses have been developed
by the iBiomatics® subsidiary of the SAS Institute.
Data Stores may be within the warehouse structure itself
(Figure 3), or reside outside the warehouse proper. A
SAS/Access® View of an RDBMS table can be as easily
referenced as a SAS table. By explicitly defining the data
type and the location of Data Stores, the administrator
creates a map that defines not only where the information
is located, but also how to read it.
Figure 3
Warehouse Metadata - Project X
SAS
Data
set
XML
Markup
File
Sequence
Graphs
Formatted
Genotype
Report
The explicit grouping capabilities of the ODD layer are
also critical in managing pharmacogenomic data (Figure
4). Pharmacogenomic data is layered data: sequences
provide the basis for SNP definition, SNP variability is
interpreted to provide genotypes. Genotypes are
expressioned to define phenotypes.
Figure 4
Pharmacogenomic Warehouse
Metadata Layer
Project A
ODD Group
Single SNP
Genotype
Sequence Data
Sequence Graph
Genotype Report
Project AAA
Project B
ODD Group
Multiple SNP Genotype
SNP1 Sequence Data
SNP1Sequence Graph
SNP2 Sequence Data
SNP2 Sequence Graph
SNP3 Sequence Data
SNP3 Sequence Graph
Genotype Data
Folder
Sequence Data
Genotype Report
(XML)
Sequence Graph
The ODD grouping of the data provides a metadata based
organization of and audit trail of the interpretation steps.
The SAS/Warehouse Administrator automatically
documents the data flows so information can be traced
from its source throughout the warehouse. The metadata
thus provides an audit trial of what sequence produced
what SNP(s) produced what genotype produced what
phenotype.
Changing Data in an Emerging Discipline
Pharmacogenomic interpretations may be done
immediately, but often are performed against databases on
the web, or at a contract research laboratory. The data
group may thus be incomplete: the sequence has been
determined but the interpretation is not yet available. A
pharmacogenomic data warehouse must be able to track
the completeness of the interpretation stream, and
maintain multiple versions of the interpretation.
Completeness checks can be implemented in SAS data
warehouses by an Online Analytical Processing (OLAP)
summarization process which tracks the data available at
each level and summarizes this for the patient and the
project as a whole (see Wright, 2000 for a discussion of
OLAP in the Data Warehouse Administrator Release 2).
When a genotype definition changes a different problem
arises. By adding a SNP, the sequence must be
reanalyzed to produce a new genotype/phenotype. This
new interpretation must be stored as a new version of the
patient genotype, with a date stamp to indicate the
effective date of the reinterpretation (Figure 5).
Figure 5
Pharmacogenomics Data Warehouse
Metadata LayerVersion Control
Genotype Load
Engine
Genotype Version
1
Genotype Definition
Genotype Version
2
Sequence Data
Sequence Export
Engine
Sequence File
The need to reanalyze sequences requires that a
pharmacogenomics data management system be able to
store the sequence data in its original format so that data
can be exported to the genotype analysis engine in the
analysis ready form. When the new analysis is available, it
must be imported as a new version of the genotype and
linked to its sequence, with the old genotype remaining as
the original version.
Integrating Pharmacogenomics into Clinical
Trials
Pharmacogenomic sequences, SNPs, genotypes and
phenotypes are results in and of themselves, and require
concise data views and reports. But pharmacogenomic’s
real utility in clinical trials resides in linking its results to
other clinical trial data. In this view, genotypes and
phenotypes serve as population markers (responder vs.
non-responder, susceptible strain vs. resistant strain),
rather than a result. The implication is that the ability to
link the pharmacogenomics warehouse to other clinical
trial warehouses is essential (see Koprowski and Fowler
2000 for a similar situation with Pharmacokinetic data).
One means of linking multiple sources of data is the Data
Mart. A Data Mart is a limited data warehouse or data
group created within the metadata of the warehouse and
designed to meet a specific need. Multiple Data Marts
can be created for different needs. The processes that push
warehouse data to a Data Mart select only the data
elements or types needed and can restructure tabular data
for easy merging and use in reporting. The SAS/
Enterprise Information System® (EIS) or SAS/Enterprise
Reporter® can draw data from the Data Marts.
A second means is to link warehouses into a larger
structure. The SAS/Warehouse Administrator allows a
parent data warehouse to be constructed from child data
warehouses.
The parent warehouse metadata map does not always
point directly to data stores, it can point to ODD elements
within the child warehouses’ metadata (Figure 6). Reports
and data views that require only one class of data are
based on the child warehouses. Reports and data views
that must combine data are based upon the parent
warehouse, which provides the directions to the child
warehouses, which in turn provide the directions to the
specific Data Stores.
Figure 6
OLAP data stores can work directly with the SAS/EIS
reporting tool, for SAS/Warehouse Administrator
metadata can easily be exported and automatically
registered in the SAS/EIS repository.
Parent Warehouse Metadata
Patient Anonymity
Patient
Demographics
Metadata
Metadata
Pharmacogenomics
Warehouse
Clinical Chemistry
Warehouse
Pharmacogenomic Data in Clinical Trials
In the example shown in figure 7, a patient screening
report would draw its data from the parent warehouse,
which would direct a query to a patient data store to
retrieve patient demographics, a pharmacogenomics
warehouse to retrieve the genotype and to a hematology
warehouse to retrieve the biomarkers for a report.
The multi-warehouse report shown in Figure 7 rests upon
using the patient as the key identifier. A patient’s genetic
information is protected by government regulations that
protect a patient’s privacy (anonymity) and prohibit public
disclosure of genetic information. Any process that draws
data from a pharmacogenomics data warehouse must be
able to provide or hide the patient’s identity.
The SAS Data Warehouse Administrator Process Editor
allows the creation of job streams that produce output data
structures. The job stream can contain a User Exit
process, which invokes a user written routine. In the case
of patient anonymity, this standard routine would check
the level of required patient anonymity against the output
data structure’s security metadata, and then either pass the
patient identifier’s on, or substitute an untraceable patient
identifier from a randomized lookup table (Figure 8).
Figure 8
Figure 7
Multi-Warehouse Report
______Patient______
Number Sex
Age
437218
437271
437303
437361
F
M
F
M
37
29
24
32
Metadata
HIV
Genotype
CD4
Type B
Type B
Type A
Type C
483
509
958
783
Lymphocytes
2.46
2.68
4.03
3.79
Metadata driven analysis and reporting engines (see
Burger and Pochon, 2000) work particularly well with
data warehouses, for the operational metadata that drives
these engines is stored in the warehouse’s ODD.
The summarization capabilities that are designed into the
SAS/ Warehouse Administrator are a key feature for
presenting pharmacogenomic data. In release 2, summary
data is stored in OLAP Tables or MDDB. OLAP data
stores may be organized into OLAP Groups.
Population profiles for genotypes are a natural fit with
OLAP processing. Such profiles are particularly useful in
AIDS trials, where virus mutation to a resistant strain must
be monitored over time. The patient HIV genotypes over
time are the detail data. Time, treatment, genotype and
resistance are the dimensions and the OLAP Table
summarizes the population and its changes over time.
Patient
Genotypes
Retrieve
Genotypes
Check
Authorization
User
Patient
Identifiers
Retrieve Patient
Identifiers
Randomized
Identifiers
Retrieve
Randomized
Identifiers
Output
Table
The definition of what identifies a patient is not clear-cut.
Certainly the Patient Number or similar unique identifier
falls within this definition. But certain combinations of
data, such as investigator, patient initials and data of
birth may also be sufficient to identify an individual.
This is another area where change is moving faster than
industry and regulatory standards.
Trademark Notice
Summary
Pharmacogenomics holds great promise for drug
development. In order to successfully integrate
pharmacogenomics into clinical trials, we must recognize
the data management issues that the nature of
pharmacogenomic data and the state of the science raise.
We have examined five key issues in this paper, and
shown how the SAS product suite, and in particular data
warehouses created with the SAS/Warehouse
Administrator can provide robust, flexible solutions to
these problems. We hope this paper stimulates further
thought and discussion, for the revolution has just begun.
References
Burger, Thomas H. and Philip M. Pochon. 2000
Techniques for Warehousing Object-Based Analyses.
Proc. of the 2000 Pharmaceutical SAS Users Group.
Seattle, WA.
Koprowski, S.P. and Fowler, D.J. 2000 Constructing a Data
Warehouse for Pharmacokinetic Data Proc. of the Twenty-
Fifth Annual SAS Users Group International
Conference. Indianapolis, IN.
Pochon, Philip M. and Thomas H. Burger. 2000
Warehousing, Metadata, and Object-Based Analysis.
Proc. of the Twenty-Fifth Annual SAS Users Group
International Conference. Indianapolis, IN.
SAS Institute Inc. 1998 SAS Rapid Warehousing
Methodology SAS Institute White Paper SAS Institute
Inc., Cary, NC.
SAS Institute Inc. 1999 SAS Warehouse/Administrator®
User’s Guide Release 2.0, First Edition SAS Institute
Inc., Cary, NC. 416pp.
Wright, Ken. 2000 New Features in the SAS/Warehouse
AdministratorTM Proc. of the Twenty-Fifth Annual SAS
Users Group International Conference. Indianapolis,
IN.
Welbrock, P.R. 1998 Strategic Data Warehousing
Principles Using SAS Software. SAS Institute Inc,
Cary, NC. 384 pp.
SAS, SAS Warehouse Administrator, SAS/EIS and SAS
Enterprise Reporter are registered trademarks of the SAS
Institute Inc, Cary, NC and other countries.
iBiomatics is a registered trademark of iBiomatics LLC,
Cary, NC.
Author Contact
Philip M. Pochon
Computer Task Group, Inc.
Castle Creek IV Suite 208
5875 Castle Creek Parkway
Indianapolis, IN 46250-4344
Phone (317) 578-5100
Julie Hiese
Covance CLS
8211 SciCor Drive
Indianapolis, IN 46214-2985
Phone (317) 273-4755