Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Warehousing Clinical Pharmacogenomics Data Philip M. Pochon, Computer Task Group, Inc., Indianapolis, IN Julie Heise, Covance CLS, Indianapolis, IN Figure 1 Abstract On June 26, 2000 Celera and the Human Genome Project jointly announced that the first reading of the human genome was complete. On Feb 13, 2001 the two presented the first full genotype maps of the genome. For the pharmaceutical industry there can be little doubt that pharmacogenomics will be an ongoing revolution; and as an emerging discipline, clinical pharmacogenomics provides immense opportunities in the areas of drug development and medical therapies. In order to successfully integrate pharmacogenomics into clinical trials, we must recognize the data management issues that the nature of pharmacogenomic data and the state of the science raise. These issues include: •= What types and forms of pharmacogenomic information must be managed? •= How should we handle changing data in an emerging discipline? •= How is pharmacogenomic information to be used to improve clinical trials? •= How should pharmacogenomic data be presented to clinical trial researchers? •= How can we meet evolving government regulations that define the patient’s genetic privacy rights? The focus of this paper is not on resolving the above issues, but on how the SAS® suite of tools (in particular the SAS/Warehouse Administrator® tools) can be the means for developing robust data management solutions that will support the requirements derived from these issues and any future pharmacogenomics concerns. Therapeutic Outcome Responders Patient Genotypes This population sub-setting may be done prospectively (during patient screening), retrospectively (to explain why a study’s outcome was less than ideal) or over time (to monitor viral mutations that confers resistance to the study drug). Pharmacogenomics starts with the DNA sequence for a gene (or sequences for several genes) and selects key locations that are likely to affect a patient’s response to a drug. A SNP (Single Nucleotide Polymorphism) is the term used for such a location. The values of the key SNP (or SNPs) define the genotypes. Genotypes are then grouped into phenotype response groups (Figure 2). Figure 2 TG Responder Pharmacogenomics Overview TC Sequence Pharmacogenomics at its broadest is the application of genetic information to determine the right dose of the right drug for the right patient. Pharmaceutical interest in genetic information has initially focused on two areas: •= Human genetic variability that affects metabolism of a drug. •= Viral genetic variability that affects resistance to a drug. The goal is to take a sample population and group it into responders, non-responders and adverse responders (Figure 1). A T T G G C Genotypes AC NonResponder AG SNP 1 SNP 2 GA Adverse Responder Many genotypes of interest to pharmacogenomics are still undefined, or being defined. Independent research may add a SNP to the genotype at any time during or after the clinical trial. The basic units that organize pharmacogenomic data are still changing. Having multiple analysis steps when definitions are still changing has three major implications for any pharmacogenomics data management system: •= All levels of pharmacogenomic information must be saved •= The lower levels (raw machine data and sequences) must be stored in their original formats, so reinterpretation is possible •= Completeness checks and version control of pharmacogenomic information are critical. Pharmacogenomic Data Types and Formats DNA sequences are the core of pharmacogenomic data. Sequencing equipment usually provides a patient’s sequence in a vector form, and a graph of the observed values from which the four-character genetic code (A, C, G, and T) is derived. Patient genotype/phenotype definitions derived from the sequence may be in a structured data table, a structured data document or embedded in formatted reports or graphs. Any pharmacogenomics data management system must thus be able to handle: •= Structured data tables (SAS data sets) •= Structured data documents (XML) •= Unstructured documents (reports and graphs) Due to the variety of file formats, graphs, reports, and documents generated by lab equipment, warehouses must have the ability to store unstructured data types. These are not only the “raw” data for interpretation, but the basis for reinterpretation should the genotype definition change. The SAS/Warehouse Administrator Release 2 is ideally suited to managing pharmacogenomics data. Conventional databases are table based and require the data be aggregated within the database itself. A SAS data warehouse imposes no such restrictions. The key to a SAS data warehouse is the use of metadata, which is data about the data (see Pochon and Burger, 2000 for an in depth discussion of metadata). The Operational Data Definition (ODD) metadata layer of the SAS/Warehouse Administrator is designed to define and group various data types and storage modes. ODD metadata defines: •= the structure of inputs for the warehouse (Data Files and External Files) and the process by which they or their data will be brought into the warehouse •= the structure of Data Stores (Detail Data and Summary Data) within the warehouse Load Steps can be created as a metadata record within the SAS/Warehouse Administrator. A Load Step is a process which reads data from one or more Data Files or External Files, creates an instance of a Data Store and loads the target Data Store. Import and Export engines for pharmacogenomics data warehouses have been developed by the iBiomatics® subsidiary of the SAS Institute. Data Stores may be within the warehouse structure itself (Figure 3), or reside outside the warehouse proper. A SAS/Access® View of an RDBMS table can be as easily referenced as a SAS table. By explicitly defining the data type and the location of Data Stores, the administrator creates a map that defines not only where the information is located, but also how to read it. Figure 3 Warehouse Metadata - Project X SAS Data set XML Markup File Sequence Graphs Formatted Genotype Report The explicit grouping capabilities of the ODD layer are also critical in managing pharmacogenomic data (Figure 4). Pharmacogenomic data is layered data: sequences provide the basis for SNP definition, SNP variability is interpreted to provide genotypes. Genotypes are expressioned to define phenotypes. Figure 4 Pharmacogenomic Warehouse Metadata Layer Project A ODD Group Single SNP Genotype Sequence Data Sequence Graph Genotype Report Project AAA Project B ODD Group Multiple SNP Genotype SNP1 Sequence Data SNP1Sequence Graph SNP2 Sequence Data SNP2 Sequence Graph SNP3 Sequence Data SNP3 Sequence Graph Genotype Data Folder Sequence Data Genotype Report (XML) Sequence Graph The ODD grouping of the data provides a metadata based organization of and audit trail of the interpretation steps. The SAS/Warehouse Administrator automatically documents the data flows so information can be traced from its source throughout the warehouse. The metadata thus provides an audit trial of what sequence produced what SNP(s) produced what genotype produced what phenotype. Changing Data in an Emerging Discipline Pharmacogenomic interpretations may be done immediately, but often are performed against databases on the web, or at a contract research laboratory. The data group may thus be incomplete: the sequence has been determined but the interpretation is not yet available. A pharmacogenomic data warehouse must be able to track the completeness of the interpretation stream, and maintain multiple versions of the interpretation. Completeness checks can be implemented in SAS data warehouses by an Online Analytical Processing (OLAP) summarization process which tracks the data available at each level and summarizes this for the patient and the project as a whole (see Wright, 2000 for a discussion of OLAP in the Data Warehouse Administrator Release 2). When a genotype definition changes a different problem arises. By adding a SNP, the sequence must be reanalyzed to produce a new genotype/phenotype. This new interpretation must be stored as a new version of the patient genotype, with a date stamp to indicate the effective date of the reinterpretation (Figure 5). Figure 5 Pharmacogenomics Data Warehouse Metadata LayerVersion Control Genotype Load Engine Genotype Version 1 Genotype Definition Genotype Version 2 Sequence Data Sequence Export Engine Sequence File The need to reanalyze sequences requires that a pharmacogenomics data management system be able to store the sequence data in its original format so that data can be exported to the genotype analysis engine in the analysis ready form. When the new analysis is available, it must be imported as a new version of the genotype and linked to its sequence, with the old genotype remaining as the original version. Integrating Pharmacogenomics into Clinical Trials Pharmacogenomic sequences, SNPs, genotypes and phenotypes are results in and of themselves, and require concise data views and reports. But pharmacogenomic’s real utility in clinical trials resides in linking its results to other clinical trial data. In this view, genotypes and phenotypes serve as population markers (responder vs. non-responder, susceptible strain vs. resistant strain), rather than a result. The implication is that the ability to link the pharmacogenomics warehouse to other clinical trial warehouses is essential (see Koprowski and Fowler 2000 for a similar situation with Pharmacokinetic data). One means of linking multiple sources of data is the Data Mart. A Data Mart is a limited data warehouse or data group created within the metadata of the warehouse and designed to meet a specific need. Multiple Data Marts can be created for different needs. The processes that push warehouse data to a Data Mart select only the data elements or types needed and can restructure tabular data for easy merging and use in reporting. The SAS/ Enterprise Information System® (EIS) or SAS/Enterprise Reporter® can draw data from the Data Marts. A second means is to link warehouses into a larger structure. The SAS/Warehouse Administrator allows a parent data warehouse to be constructed from child data warehouses. The parent warehouse metadata map does not always point directly to data stores, it can point to ODD elements within the child warehouses’ metadata (Figure 6). Reports and data views that require only one class of data are based on the child warehouses. Reports and data views that must combine data are based upon the parent warehouse, which provides the directions to the child warehouses, which in turn provide the directions to the specific Data Stores. Figure 6 OLAP data stores can work directly with the SAS/EIS reporting tool, for SAS/Warehouse Administrator metadata can easily be exported and automatically registered in the SAS/EIS repository. Parent Warehouse Metadata Patient Anonymity Patient Demographics Metadata Metadata Pharmacogenomics Warehouse Clinical Chemistry Warehouse Pharmacogenomic Data in Clinical Trials In the example shown in figure 7, a patient screening report would draw its data from the parent warehouse, which would direct a query to a patient data store to retrieve patient demographics, a pharmacogenomics warehouse to retrieve the genotype and to a hematology warehouse to retrieve the biomarkers for a report. The multi-warehouse report shown in Figure 7 rests upon using the patient as the key identifier. A patient’s genetic information is protected by government regulations that protect a patient’s privacy (anonymity) and prohibit public disclosure of genetic information. Any process that draws data from a pharmacogenomics data warehouse must be able to provide or hide the patient’s identity. The SAS Data Warehouse Administrator Process Editor allows the creation of job streams that produce output data structures. The job stream can contain a User Exit process, which invokes a user written routine. In the case of patient anonymity, this standard routine would check the level of required patient anonymity against the output data structure’s security metadata, and then either pass the patient identifier’s on, or substitute an untraceable patient identifier from a randomized lookup table (Figure 8). Figure 8 Figure 7 Multi-Warehouse Report ______Patient______ Number Sex Age 437218 437271 437303 437361 F M F M 37 29 24 32 Metadata HIV Genotype CD4 Type B Type B Type A Type C 483 509 958 783 Lymphocytes 2.46 2.68 4.03 3.79 Metadata driven analysis and reporting engines (see Burger and Pochon, 2000) work particularly well with data warehouses, for the operational metadata that drives these engines is stored in the warehouse’s ODD. The summarization capabilities that are designed into the SAS/ Warehouse Administrator are a key feature for presenting pharmacogenomic data. In release 2, summary data is stored in OLAP Tables or MDDB. OLAP data stores may be organized into OLAP Groups. Population profiles for genotypes are a natural fit with OLAP processing. Such profiles are particularly useful in AIDS trials, where virus mutation to a resistant strain must be monitored over time. The patient HIV genotypes over time are the detail data. Time, treatment, genotype and resistance are the dimensions and the OLAP Table summarizes the population and its changes over time. Patient Genotypes Retrieve Genotypes Check Authorization User Patient Identifiers Retrieve Patient Identifiers Randomized Identifiers Retrieve Randomized Identifiers Output Table The definition of what identifies a patient is not clear-cut. Certainly the Patient Number or similar unique identifier falls within this definition. But certain combinations of data, such as investigator, patient initials and data of birth may also be sufficient to identify an individual. This is another area where change is moving faster than industry and regulatory standards. Trademark Notice Summary Pharmacogenomics holds great promise for drug development. In order to successfully integrate pharmacogenomics into clinical trials, we must recognize the data management issues that the nature of pharmacogenomic data and the state of the science raise. We have examined five key issues in this paper, and shown how the SAS product suite, and in particular data warehouses created with the SAS/Warehouse Administrator can provide robust, flexible solutions to these problems. We hope this paper stimulates further thought and discussion, for the revolution has just begun. References Burger, Thomas H. and Philip M. Pochon. 2000 Techniques for Warehousing Object-Based Analyses. Proc. of the 2000 Pharmaceutical SAS Users Group. Seattle, WA. Koprowski, S.P. and Fowler, D.J. 2000 Constructing a Data Warehouse for Pharmacokinetic Data Proc. of the Twenty- Fifth Annual SAS Users Group International Conference. Indianapolis, IN. Pochon, Philip M. and Thomas H. Burger. 2000 Warehousing, Metadata, and Object-Based Analysis. Proc. of the Twenty-Fifth Annual SAS Users Group International Conference. Indianapolis, IN. SAS Institute Inc. 1998 SAS Rapid Warehousing Methodology SAS Institute White Paper SAS Institute Inc., Cary, NC. SAS Institute Inc. 1999 SAS Warehouse/Administrator® User’s Guide Release 2.0, First Edition SAS Institute Inc., Cary, NC. 416pp. Wright, Ken. 2000 New Features in the SAS/Warehouse AdministratorTM Proc. of the Twenty-Fifth Annual SAS Users Group International Conference. Indianapolis, IN. Welbrock, P.R. 1998 Strategic Data Warehousing Principles Using SAS Software. SAS Institute Inc, Cary, NC. 384 pp. SAS, SAS Warehouse Administrator, SAS/EIS and SAS Enterprise Reporter are registered trademarks of the SAS Institute Inc, Cary, NC and other countries. iBiomatics is a registered trademark of iBiomatics LLC, Cary, NC. Author Contact Philip M. Pochon Computer Task Group, Inc. Castle Creek IV Suite 208 5875 Castle Creek Parkway Indianapolis, IN 46250-4344 Phone (317) 578-5100 Julie Hiese Covance CLS 8211 SciCor Drive Indianapolis, IN 46214-2985 Phone (317) 273-4755