* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
Survey
Document related concepts
Transcript
Informatics Support of Data management for multi-centric clinical studies: Integrating clinical and genetics/genomic data Prakash M. Nadkarni, Kexin Sun and Cynthia Brandt Scope and Definitions Clinical research is increasingly concerned with the influence of inheritable traits on disease. Genetics: The study of the organization, regulation, function and transmission of heritable information in organisms. (UniGuide Academic Guide to the Internet, www.aldea.com) Genomics: investigations into the structure and function of very large numbers of genes undertaken in a simultaneous fashion. (UC Davis Genome Center, http://genomics.ucdavis.edu ) The choice of phrase depends on your age group and your agenda (Dr. Rochelle Long, NIGMS) Types of Genomic Research Structural genomics An initial phase of genome analysis, whose end-point is to yield high resolution genetic and physical maps of an organism. The ultimate physical map of an organism is its complete DNA sequence. The sequence itself can never be known with complete precision, because there is variation in parts of the sequence that is seen across individuals. Functional genomics Development and application of large-scale (genome-wide or system-wide) and/ or high throughput approaches to assess gene function, using info/ reagents provided by structural genomics. Combined with statistical/computational analysis of results. Variation in Sequence: Mutations and Polymorphisms Some parts of the sequence of an organism appear highly stable, while others show variation. A variation that is prevalent enough to occur in at least 1% of a population (not necessarily the human race as a whole) is called a polymorphism. The polymorphisms that have been most widely studied are single nucleotide polymorphisms (SNPs). However, repeats are also important. (Hungtington’s chorea characterized by repeats in part of a gene – the more the repeats, the greater the likelihood of severity + earlier onset.) Genotype and Phenotype (I) An important aspect of functional genomics – variation in sequence (genotype) leads to variation in function as expressed at molecular, cellular, organ and system levels (phenotype). Phenotype is “the outward, physical manifestation of internally coded, inheritable information” (Blamire, http://www.brooklyn.cuny.edu/). “Correlation of genotype to phenotype” is one of the goals of several recent cooperative efforts – e.g., the Pharmacogenetics Research Network. Correlating Genotype with Phenotype Clinical Studies that determine this correlation commence in two directions Genotype to phenotype: screening of large numbers of individuals at particular genetic loci (that encode for known proteins) identifies particular variations, which may result in a variant end-product. These subjects are persuaded to enroll in studies where, along with controls, they are challenged with particular drugs. Variations in response are then measured. Correlating Genotype with Phenotype (II) Phenotype to genotype: patients with particular clinical traits (e.g., poor response to standard therapies for specific conditions) are identified. These subjects are then screened at multiple candidate loci (for genes known or suspected to be involved in that disease) to identify variations. For conditions such as heart disease or hypertension, hundreds of loci may have to be screened. One practical problem here is that of statistical significance – if a sufficiently large number of loci are screened, “significance” may be seen even with random data sets. Representing Genotype Computationally Genotype is relatively straightforward to represent computationally, as variations from a consensus sequence- substitutions, insertions, deletions, or variations in repeats. Length of sequence to consider - Rather than focusing on individual variants in isolation, it is preferable to consider a set of several such variants that are inherited as a unit (the haplotype). NCBI’s dbSNP database supports genotype and haplotype representation. Human Phenotyping Studies and Repositories “Phenotype” means different things to clinical researchers and to classical human or animal geneticists. To the latter, it has traditionally been a “syndrome”, consisting of one or more detectable or visible traits. These days, it is more likely to be defined in terms of variation from the norm (for better or for worse). It is characterized by clinical studies. The single most useful catalog of human variation is Online Mendelian Inheritance in Man (OMIM), maintained by Victor McKusick’s team at Johns Hopkins and made accessible via NCBI’s Web site. Human Phenotyping Studies and Repositories (II) The ultimate goal is to create national repositories of phenotypic data that are computable, in that they contain structured data. The purpose behind the ability to store “raw” data is to facilitate possible mining of the data. OMIM is a text database, and despite its great value has limited computability. Challenges in Representing Phenotype Phenotype is not a single entity: it is a set of parameters. The universe of parameters constituting “phenotype” is highly variable: function can be characterized at a molecular, organelle, cellular, organ-system or whole-organism level. The parameters are specific to the gene or genes that are being studied. Across all genes or genetic disorders of interest, the total number of parameters would range in the hundreds of thousands. Creating Databases to record Phenotype Characterization Studies The problem of representing phenotypic data is very similar to the problem of representing clinical patient data in clinical patient record systems. A vast number of clinical parameters can potentially apply to a human subject, but for a given clinical study, only a modest number of parameters actually apply. The same modeling approach – (Entity-AttributeValue) can be used. Historically, first used in the TMR system (Stead and Hammond), and later, the HELP system at LDS. Put on a firm relational database footing by the Columbia-Presbyterian CDR efforts Clinical Study Data Management Systems vs. CDRs (I) In Clinical Study databases, clinical data gathering is not open-ended. It is typically segregated into events (“visits” at an outpatient level), whose schedule is determined by the study protocol. The parameters that are recorded at each Event is determined in advance. For reasons of patient safety and economy, all parameters are not sampled at all events. In clinical studies, individual response to therapy is less important than how subjects react as a group. Relative time points based on Events are therefore more important than the absolute date-time when an event occurred. (This impacts temporal querying of the data.) Parameter report Clinical Study Data Management Systems vs. CDRs (II) Certain areas, e.g., psychiatry, are characterized by extensive data gathering based on questionnaires. Most questionnaire items do not map to standard controlled vocabularies – each questionnaire in effect constitutes its own vocabulary. During data entry for questionnaires, extensive dependencies between individual parameters require support for “skip logic” – certain parameters disabled for entry based on values entered for previous parameters. In general, automatic generation of Web-enabled forms for robust data entry is a high priority, especially when numerous concurrent studies are to be supported with modest human resources. Clinical Study Data Management Systems vs. CDRs (III) Often, a set of parameters, rather a single attribute, conveys meaningful information. E.g., to describe an adverse drug reaction The nature of the reaction as described by patient/clinician The best match to a controlled vocabulary term Severity- this is usually “anchored” to a reference scale (e.g., NCI common toxicity criteria) where possible. Whether it responded to treatment Whether therapy needed to be stopped. “Severity” is meaningless in isolation (severity of what?). Handling Genetic Data (I) While the human genome has been sequenced, we don’t know what the vast majority of the DNA does New genes are still being discovered. Traditional approaches, such as collection of pedigree data and linkage analysis, still apply. For voluminous data such as Mass-spectrometry data (Proteomics) or even raw Gene Expression/ Microarray data, consider storing the data in its original format for the most part, with the database only tracking the location of the data files. Decomposing increases the bulk greatly, with questionable benefit for a stream of X-Y pairs Many analytical programs have been created to operate on data in their original formats. Handling Genetic Data (II) The description of gene array experiments lend themselves to attribute-value modeling approaches, because, despite efforts to create controlled-vocabulary descriptors, many of the descriptors are specific to the research problem being studied. Certain summary results may be databased. Consider the use of display technologies like Scalable Vector Graphics (SVG) to generate interactive graphics. Pedigree diagrams Summary of polymorphism data for individual genes. SVG is based on XML- Web server generates a stream of XML that is interpreted by a plug-in as a stream of drawing instructions, to render a graphic. Interchanging Data Be prepared to bulk-import from a variety of formats (e.g. spreadsheets) to bootstrap the database from legacy data. XML is potentially attractive, but must be used judiciously. Remember that someone must write a program to put the data into XML format. For phenotypic data, creating an endless number of domainspecific tags achieves little beyond full employment for programmers- consider simple formats that are the direct counterpart of attribute-value structures – EDSP. Avoid highly nested structures that increase programming effort. XML should follow from the data model – it should not be used to define a data model (lessons from the MicroArray and Gene Expression Data (MGED) group.