Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Generation of a Clinical Data Warehouse across Multiple Companies Sarbjit Rai, Genentech, Inc., South San Francisco, CA be resolved. Minutes were taken at each meeting which documented actions required and decisions made. Abstract Generation of a clinical data warehouse and the issues which can arise when collaborating across multiple companies. What is a data warehouse, why do we need it, how is it put together, software and standards used, common issues which arose and solutions which were utilized to successfully generate the final product. Data Warehouse Charter As one of their frrst tasks the team put together a data warehouse charter which was a WORD document used to identil)' what needed to be done and how, in order to produce a successful data warehouse product This charter was a live document which was regularly updated by team members as new ideas, rules, issues or timelines were discussed. All information regarding the structure and content of the data warehouse and what it would/would not include was documented in the Charter. This document was constantly reviewed and referenced and proved very useful to the team during the development of the project. Introduction The Clinical Data Warehouse project was a joint development project between three Pharmaceutical and Biotech Companies from the United States (East and West Coast) and Europe. The aim of this data warehouse was to enable all three companies to store and share information and integrate clinical data for submission purposes. Each company conducted their own clinical studies on drug X according to their own SOPs and company processes. Since each company had different clinical databases in terms of structure, content and format a Common Data Model was deemed necessary to enable integration of data for USA and European submission filings. The common data model consisted of SAS datasets which were in a standard format agreed upon up front by all three companies. Data Warehouse Structure and Content This data warehouse was essentially a web based system which resided in one location (company X) but could be accessed by all team members across companies. It consisted of the following three components: • • • The data warehouse hence consisted ofSAS data and supporting documentation from studies conducted by all three companies which could be subsequently accessed by all three companies and integrated for submission purposes. The data warehouse project essentially consisted of three components: • • • Description and documentation Clinical Data (SAS data sets) Data displays (Listings, Tables, Figures) The documentation essentially consisted of excel spreadsheets and word documents covering: • Development of the Data Warehouse Development ofSAS programs and loading of test data to ensure the process actually worked Final programming and loading of real data into the Data Warehouse once studies are complete • The format and content of all the datasets in the common data model Supporting documents (Protocols, annotated CRFs, Statistical analysis plans) fur each study The clinical data consisted of the following three components: • • • This paper will focus on the key elements involved in the development of a clinical data warehouse, the issues and challenges that the team experienced during project development, tools and processes used and lessons learned. Company specific study datasets The integrated datasets The filing datasets All data and documentation was stored in individual study directories within each company directory. Team Members and Meetings The company specific datasets were the original SAS datasets generated by each individual company according to their own structure, format and SOPs and used for individual study reporting. These data sets did not require any additional manipulation prior to their transfer into the data warehouse and were copied into the data warehouse (with supporting documentation) on completion of the study. A data warehouse team was put together and consisted of statisticians, programmers and data management staff from each company. The team initially met at one site for a start-up meeting and then either weekly or monthly via teleconference. The initial face to face meeting was set up to allow everyone to meet the team, identicy what data was to be collected, put together initial ideas of what the data warebouse would consist of in terms of structure and content and to set initial timelines for completion of the work. At least one programmer and statistician was assigned from each company to work on this project and act as the primary company contact for all queries by the team. This team met regularly to discuss project status, identil)' the team goals, review timelines for completion of specific tasks and identil)' any ongoing issues which needed to The integrated datasets were the new datasets which were mapped by each company according to the common data model to allow for integration of the data for submission purposes. Twenty seven data sets were identified in total by the team (eg. demographics, labs, medical history, etc) to be mapped from individual study data and included in the common data model. This data consisted of all safety and efficacy data which was 288 collected within the various studies and a large patient dataset (the PAT file) which contained mainly derived variables required for statistical analyses. Where possible efforts were made to ensure there were minimal differences between the derivations for similar variables across studies and companies. content. Occasional video conferences and face-to-face meetings were also held once a specific development milestone had been met. The filing datasets were the final datasets generated by each company from the integrated datasets for submission purposes. These did not have to have a common structure and could be company specific. The specifications for each common data model dataset were documented in a separate excel spreadsheet (eg. one for DEMOG, one for LAB, etc). All excel spreadsheets contained company specific information aswell as the common data model information. The following information was initially documented in excel by each company for each data set: Common Data Model Structure The data displays consisted oflistings, tables and graphs from the individual studies and submission files. • • • How the Team Worked The main bulk of the work involved developing specifications for the twenty seven datasets going into the common data model. Communication was done mainly via email and phone calls with teleconferences being held bi-monthly. Initially programmers and statisticians had separate teleconferences to discuss their own specific data issues and develop their own data sets for the common data model (the three main efficacy data sets were developed by the statisticians and the remainder data sets were developed by the programmers). Later joint meetings were held to discuss status of the common data model and come to agreement on common issues regarding data format and The following specification were then agreed upon by the team for the common data model: • • • • 8 8 $40 DATE9. Label ·Patient Age Race ddmnun....mi_ Date Variable Name 'J'n>e/L~b Label $17 Clinical Stud_y CENTER $8. Center Number PATNUM 8 Subject ID AGE 8 Age in years SEX $6 Sex RACE $40 Race EVALE $3 Per Protocol Po])lllatioo EVALS $3 Safety Evaluable Populatioo EVALR $3 m FSTIXDC $9 Date of First Stud..L_Dru_g_ TR1N 8 Treatment Grou_ll_Number TRTC $80 Treatment Group Text Name Genentech Label Derivation Name CompanyB Derivatioo Label In addition the following formatting rules were used across all datasets for dates and times: A number of core variables were agreed by the team to be included in all datasets. These included FDA required variables (study, center, patient, age, sex, race) and some additional variables deemed necessary by the team (eg. start date of study medication). Where possible the team tried to map like variables with like across the three companies to save on space and ensure easy integration of the data. The core list included the following variables: STUDY SAS Variable Name Type/Length Format SAS Label The following standard template was used for all specifications: Common Data Model Format ~ Name Patnum Age Race Oat SAS Variable Name SAS label Derivation (if applicable) Variable name Leuetb Remark XxxDC $9 Character date, format XxxTC $8 DDMMMYYYY Character time format llli:MM:SS Only the original date that was collected on the CRF was to be stored in these fields. For Genentech this meant stripping out any default days or months before including the date variable in the common data model since at Genentech all missing days and months are defaulted to 151h June in ORACLE CLINICAL. Efforts were also made to come up with a standard naming convention for the variables but it became clear as the data sets were being developed that it would be easier for the programmers from each company to use their original names where possible in at least some of the datasets since most of the data in the common data model would be converted back to company specific names and formats before submission programming began to allow each company to use their standard in-house code for submission programming. The naming conventions used across datasets were therefore not always consistent since some datasets used Genentech variable names for the common data model and others used Company X or Company Y variable names. Populatioo 289 complications in becoming familiar with an older study which the programmer may not have originally worked on and familiarizing themselves with study design, formats, derivations and original programming rules which sometimes required investigation and added an additional learning curve to an already complex task. Process for Completion of the Work Each excel specification document was started by a programmer in one company then passed onto a programmer in another company via email to complete his/her company specific sections. Once all three companies had added their individual study specific information the cormnon data model for that particular dataset was generated. One person was assigned to develop the common data model for a dataset. Updating the excel documents by all three companies was also a challenge since it involved passing excel sheets from one company to the next via email and remembering who had the latest version. The work was divided equally between the three companies so that each programmer (or statistician) was responsible for generating XXX number of datasets. The other two programmers were then asked to review and comment if they had any queries regarding the common format. Once a specific dataset had been completed by all three companies, the primary programmer and statistician from each company was required to review, approve and sign off the specification via email. The final specification was then posted to the shareweb repository and mapping programming and testing of the aetna! process could begin. The data warehouse was physically located within a shareweb system at one company only. Access was restricted to a specific group of people in statistics and data management dealing with the exchange and analysis of data (essentially the data warehouse team). Difficulties were initially faced by data warehouse team members in the other two companies in obtaining authorization and access to the system. Working in three different time zones also made meetings and fast resolution of issues a challenge, since one person would be asleep or at home whilst another was working. However this sometimes worked in our favor since it meant different people could work on the same documents at different times and then pass it on to the next person in a logical fashion. Issues and Challenges The main challenge which the team faced during development of this project was in determining the data warehouse structure and content and coming to agreement on the standard naming conventions, formats, labels and content of the twenty seven SAS data sets. The aim was to minimize the programming work required by all three companies without impacting quality. Hence produce a viable end-product that could be used by all three companies for exploratory analyses and submission purposes. Lessons Learned This project is still ongoing. Initial data specifications have been completed and testing I programming is now underway. Some of the lessons learned so far have been documented below. The importance of documenting rules and algorithms used in study reporting (particularly older studies) was clearly shown to be invaluable and something which can hinder development at a later date if the data is required for future analyses (or integrated reporting. The importance of standards and good programming practices was also highlighted. In particular having industry wide standards for variable names, formats, labels etc would be helpful for futnre data warehouse projects across international companies. Efforts are already underway in the industry to collaborate with the FDA to come up with standards in this area however this in itself would be another topic to present on another date! Each company had its own CRFs (case record forms) which collected data for each study .in a particular format. In addition to the normal differences between studies wbicb a programmer faces when trying to integrate data from many studies within one company we were faced with the additional challenge of differences between the three companies in: • • • • Databases used (structure and content) Data dictionaries used (COSTART, WHOAE, MEDORA) Programming standards, naming conventions, versions ofSAS Formats, algorithms and derivations used for similar variables The development and maintenance of the data warehouse charter proved useful throughout this project in acting as a reference and reminding team members of what needed to be done and when. A decision had to be made regarding which data dictionary we would use to code the medications for all studies going into the data warehouse to ensure the data could be integrated if required. Numerous discussions were held regarding version control of the dictionaries used to ensure consistency across companies and whether it would be easier to have one company code all the data or if each company should do their own. Having initial face to face meetings allowed the team to meet one another and form bonds which provided a useful basis for teamwork and to build team rapport which can sometimes be difficult if you are only in touch via email. Also having bimonthly teleconferences ensured regular communication with team members to review status, upcoming timelines and discuss any issues which could not be dealt with via email. Each company also had several clinical studies from phase I (small scale volunteer studies) to phase III (large scale pivotal trials) which were in various stages of development. Some of these studies were ongoing and others were already complete. For ongoing studies programmers were already familiar with the study design, content and programming making the mapping to the common data model a fairly stnrightforward task. For other studies that were already complete there were added In hindsight use of a standard naming convention for all SAS variables would probably have made this a more portable system which could then be sent to the FDA (or other external customers) if required. Finally planning upfront and having team collaboration through! out was essential and is important for the successful development and completion of any multi-company project. 290 Contact Information Your comments and questions are valued and encouraged. Contact the Author at: Sarbjit Rai Genentech Inc. I DNA Way South San Francisco, CA 94080-4990 E-mail: [email protected] Phone : (650) 225 4629 291