Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TOXICOLOGICAL SCIENCES 109(2), 358–371 (2009) doi:10.1093/toxsci/kfp061 Advance Access publication March 30, 2009 Toward a Public Toxicogenomics Capability for Supporting Predictive Toxicology: Survey of Current Resources and Chemical Indexing of Experiments in GEO and ArrayExpress ClarLynda R. Williams-Devane,* Maritja A. Wolf,† and Ann M. Richard‡,1 *U.S. EPA/Office of Research and Development (ORD)/National Health & Environmental Effects Research Laboratory (NHEERL), Research Triangle Park, NC 27519; †Lockheed Martin (Contractor to U.S. EPA), Research Triangle Park, NC 27519; and ‡U.S. EPA/Office of Research and Development (ORD)/National Center for Computational Toxicology (NCCT), Research Triangle Park, NC 27519 Received January 18, 2009; accepted March 23, 2009 A publicly available toxicogenomics capability for supporting predictive toxicology and meta-analysis depends on availability of gene expression data for chemical treatment scenarios, the ability to locate and aggregate such information by chemical, and broad data coverage within chemical, genomics, and toxicological information domains. This capability also depends on common genomics standards, protocol description, and functional linkages of diverse public Internet data resources. We present a survey of public genomics resources from these vantage points and conclude that, despite progress in many areas, the current state of the majority of public microarray databases is inadequate for supporting these objectives, particularly with regard to chemical indexing. To begin to address these inadequacies, we focus chemical annotation efforts on experimental content contained in the two primary public genomic resources: ArrayExpress and Gene Expression Omnibus. Automated scripts and extensive manual review were employed to transform free-text experiment descriptions into a standardized, chemically indexed inventory of experiments in both resources. These files, which include top-level summary annotations, allow for identification of current chemicalassociated experimental content, as well as chemical-exposure– related (or ‘‘Treatment’’) content of greatest potential value to toxicogenomics investigation. With these chemical-index files, it is possible for the first time to assess the breadth and overlap of chemical study space represented in these databases, and to begin to assess the sufficiency of data with shared protocols for chemical similarity inferences. Chemical indexing of public genomics databases is a first important step toward integrating chemical, toxicological and genomics data into predictive toxicology. Key Words: microarray; chemical; toxicogenomics; toxicity; prediction. Disclaimer: This manuscript was approved by the U.S. EPA’s National Center for Computational Toxicology for publication. However, the contents do not necessarily reflect the views and policies of the EPA and mention of trade names or commercial products does not constitute endorsement or recommendation for use. Each of the authors declares no competing interests pertaining to the present work. 1 To whom correspondence should be addressed at Mail Drop D343-03, 109 TW Alexander Dr., U.S. Environmental Protection Agency, Research Triangle Park, NC 27711. Fax: (919) 685-3263. E-mail: [email protected]. Published by Oxford University Press 2009. Conventional toxicology investigates cellular and animal responses to chemical treatment through domain-specific bioassay studies (e.g., chronic, developmental), typically mapping a single chemical to a toxicological endpoint. Microarray technologies, in contrast, detect genome-wide perturbations resulting from a chemical treatment, and measure response variables that probe a large number of genes and gene pathways potentially underlying multiple toxicological endpoints. A typical toxicogenomics experiment requires that linkages be established between these technologies, focusing on treatmentrelated effects of one or a few chemicals and attempting to relate gene expression changes to a toxicological endpoint (Gomase et al., 2008; Hamadeh et al., 2002; Hirabayashi and Inoue, 2002). In silico toxicogenomic meta-analysis methods combine data across existing toxicological and gene expression experiments to generate new, and to confirm existing hypotheses of the effect of a compound treatment. Such a capability depends upon the availability of gene expression data derived from chemical treatment scenarios, as well as anchoring toxicology data to support predictive inferences. The chemical nature of the problem requires a standardized, chemical-centric view of data at all levels. Hence, a publicly available toxicogenomics capability sufficiently robust for mechanistic inferences and building predictive models requires not only common data standards, protocols, and the ability to query and aggregate common data types across resources, but also broad data coverage within, and linkages across chemical, genomics and toxicological information domains. These requirements have, to varying degrees, informed development of the major public microarray databases, and have been the central design principle of specialized toxicogenomic resources (Waters et al., 2008). In recent years, there have also been significant advances in promoting toxicology standards and data models (i.e., controlled vocabulary and hierarchical data organization), quantitative high-throughput screening, and chemically indexed bioassay data that, taken as a whole, have 359 CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES the potential to greatly enhance toxicogenomics capabilities in the public domain (Dix et al., 2007; Martin et al., 2009; Richard et al., 2008; Yang et al., 2006a, 2006b). In the genomics field, the two largest public resources for deposition of microarray data, approved by the Microarray Gene Expression Data (MGED) Society (http://www.mged. org/), are the European Bioinformatics Institute’s (EBI) ArrayExpress (http://www.ebi.ac.uk/arrayexpress) and the National Center for Biotechnology Information’s (NCBI) Gene Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo). Publishing requirements for the deposition of raw or processed microarray data into these database repositories, coupled with MIAME (Minimum Information About a Microarray Experiment) standards for data reporting, are increasing the comparability, utility and breadth of these resources (Ball et al., 2004). Enhanced external programmatic access to the major public microarray data repositories also allows third parties to automatically extract and reformulate data to enhance informatics and data mining capabilities (Boyle, 2005; Ivliev et al., 2008; Zhu et al., 2008). Additional public efforts are aimed at standardizing the description of experimental protocols (Taylor et al., 2008), as well as improving toxicity data standards in relation to toxicogenomics experiments (Burgoon, 2007; Fostel, 2008; Fostel et al. 2005, 2007). Largely neglected in the genomics field, however, has been the standardization of chemical information associated with the experimental data when chemical treatment is a primary objective of the experiment. Such annotation is essential for systematically relating chemical property and effects information, irrespective of whether the study has an explicit toxicological focus, across the diverse data domains potentially contributing to toxicogenomics. Furthermore, the ability to query, relate, and aggregate information by chemical and across chemical space is essential to the goal of chemical screening and toxicity assessment (Dix et al., 2007; Richard et al., 2008; Yang et al., 2008). In the remainder of this paper, we broadly survey the current state of public microarray resources from the above vantage points, focusing particularly on the two primary resources, ArrayExpress and GEO. Although the latter resources are not explicitly designed to meet the needs of the toxicogenomics community, they currently serve as the two largest public microarray data repositories of potential toxicogenomic relevance and, as such, are potentially valuable sources of data for toxicogenomics study. Despite progress in many areas, we find the present state of public microarray repositories inadequate for supporting interoperability and linkages across diverse data domains in support of toxicogenomics. Particularly noteworthy is the lack of minimal chemical annotation and, as a result, the effective isolation of these resources and associated data from the growing inventories of chemically indexed bioassay information of potential relevance to toxicology (Richard et al., 2006, 2008). To begin to address these inadequacies, we propose and implement a set of standard genomic fields for indexing of experimental study records, aligned with current MIAME guidelines, that enables cross-referencing and comparison of total experimental content in GEO and ArrayExpress. In addition, we implement a set of established chemical standards for labeling experiments contained within ArrayExpress and GEO, in collaboration with the U.S. Environmental Protection Agency’s (EPA) Distributed Structure-Searchable Toxicity Database (DSSTox) project. We briefly describe the process of annotation and creation of public-distribution DSSTox chemical-index files for both GEO Series and Array Express Repository. These files enable, for the first time, assessment of the chemical scope, diversity, and coverage of experimental content within GEO and ArrayExpress of potential use for toxicogenomics study. METHODS For the purpose of assessing the relevance of public microarray resources to toxicogenomics and predictive toxicology, we considered the current annotation of experimental content pertaining to chemical treatment scenarios, that is, cases in which study of gene expression changes induced by chemical treatment constituted the primary goal of the experiment. As a measure of interoperability between data resources, we examined the standardization of terminology and data accessibility, as well as the formatting of data, paying particular attention to specification of experimental protocols, such as animal/ tissue/cell treatment, RNA extraction, microarray preparation, data import/ export, and analysis. As a measure of chemical indexing, we examined the degree of standardization and annotation pertaining to chemical-associated experiments across public genomics resources and, particularly, whether the chemical information was formally indexed, that is, contained in a separate, searchable field. For the purpose of chemically indexing experimental content in ArrayExpress and GEO, a chemical-exposure (or ‘‘Treatment’’) microarray experiment is broadly defined by us as a study in which the cells, tissues, or whole organisms were treated with a defined chemical, chemical mixture, or natural substance (including biologics and proteins), DNA was extracted, and gene expression changes resulting from this treatment were investigated with microarray technologies. Whether the chemical to which the system was exposed is a known toxicant, potential toxicant, natural substance, or therapeutic agent need not be distinguished because the measured outcome is the same, that is, treatment-related gene expression changes. However, experiments in which chemical treatment was secondary to the primary purpose of the experiment (e.g., treatment with prophylactic antibiotics for maintaining tissue culture conditions) or where study of chemical-exposure– induced effects was not the primary purpose of the experiment (e.g., treatment with streptozocin to induce Diabetes Mellitus for investigating the effects of diabetes) required further annotation and review. These cases of chemicalexperiment associations were labeled by us to indicate the role of the chemical as other than ‘‘Treatment.’’ For initial inventory purposes, extraction of experimental description fields, and locating chemical-associated experiments within ArrayExpress and GEO, we used available web search options and programmatic access tools within each system, as well as extensive manual review (a workflow diagram is provided in Supplemental Fig. 1; additional details of the methodology employed here are publicly available—see Acknowledgments). For the present study, GEO Series provides the most complete inventory of current experiments within GEO and these are also most closely aligned with ArrayExpress Repository experiments. Hence, ArrayExpress Repository and GEO Series experiments were the focus of the present chemical-indexing efforts. 360 WILLIAMS-DEVANE, WOLF, AND RICHARD ArrayExpress. Due to its large size (over 6300 experiments at the time of data extraction), limited and unstructured chemical annotation, and dynamic content (updated regularly with new experiments), the review and annotation process for ArrayExpress involved several iterative steps for identification and characterization of chemical treatment experiments within the main database repository. Initially, a bulk download of all data housed in the repository from the main web site ((http://www.ebi.ac.uk/arrayexpress) was undertaken with a wildcard query in the accession number query box (i.e., to retrieve all experiments). The resulting records were individually reviewed and a preliminary index of chemical information and <Indications of a Chemical Exposure Record> was constructed. The latter field included any detail deemed as potentially useful for discerning whether a record pertained to a chemical treatment experiment, for example, designations in the ArrayExpress <Experimental_Type> field, such as ‘‘dose,’’ ‘‘treatment,’’ etc. This preliminary chemical index was used to identify chemical-associated and chemical ‘‘treatment’’ experiments, to infer the minimum information necessary to identify such records from within ArrayExpress, and to build and test an automated indexing capability using custom Perl scripts (http://www.perl.com). Through an iterative process, scripts were refined to achieve better success at detecting ‘‘true’’ chemical treatment experiments, verified by manual review according to our definition above. Perl scripts for keyword text searching and filtering were combined with manual curation methods to construct ArrayExpress Repository chemical-index files from website content downloaded on September 20, 2008. Gene Expression Omnibus. GEO contains user-deposited dynamic content, and limited and unstructured chemical annotation. Hence, a manual method similar to that employed in the review of ArrayExpress was initially required. All data were downloaded from the GEO homepage in the GSE Series format. Each of the GEO Series was manually reviewed for chemical content and this information was used to construct an index of the chemical information and <Indications of a Chemical Exposure Record>. As in ArrayExpress, the <Indications of a Chemical Exposure Record> field contained details to aid in discerning whether a record pertained to a chemical treatment experiment. From this chemical-experiment index, the first chemical annotation of GEO was completed. Similar to the annotation of ArrayExpress, this manually curated chemical index was used to test and refine automated curation approaches that were applied to subsequent versions of the GEO Series inventory. Several automated methods were developed using NCBI Entrez Programming Utilities (E-Utilities) (http://www.ncbi.nlm.nih.gov/entrez/query/ static/eutils_help.html), an XML version of the U.S. National Library of Medicine’s (NLM) chemical Medical Subject Headings (MeSH) library (http:// www.ncbi.nlm.nih.gov/sites/entrez?db ¼ mesh), and a series of custom Perl scripts to parse through a complete XML version of the GEO Series database. The chemical index of GEO Series was completed using a series of Perl scripts that call on E-Utilities, combined with manual curation methods, and was based on content downloaded on September 20, 2008. Chemical index. The main result of the above process was to produce a static, preliminary chemical index for all chemical-associated microarray experiments in ArrayExpress and GEO, in which the subset of chemical ‘‘treatment’’ experiments were identified. These preliminary index files took the form of a list of minimal chemical identifiers (most often chemical names only) directly extracted from the user-deposited information in these two resources. These chemical-experiment index files subsequently underwent a rigorous cleanup and chemical quality review, using source (submitter)–provided chemical information and contextual text descriptions to definitively identify the chemical substance and its relationship to the experiment (e.g., treatment, vehicle, reference). The generally poor quality and consistency of chemical information contained in ArrayExpress and GEO submitter-supplied description fields, the high frequency of abbreviations and spelling errors, and the lack of chemical identifiers such as Chemical Abstracts Service Registry Numbers (CASRN; http://www.cas.org/) or EBI’s Chemicals of Biological Interest (ChEBI; http://www.ebi.ac.uk/chebi) identifiers, all prevented greater application of automated text-mining and chemical name-to-structure conver- sion capabilities. In addition, the need to accurately discern the role of the chemical in the experiment (i.e., treatment, etc.) from the free-text description prevented use of more efficient automated methods. DSSTox Standard Chemical Fields were assigned to the chemical-index files according to established procedures (http://www.epa.gov/ncct/dsstox/ ChemicalInfQAProcedures.html). These fields allow for standardized representation of both the test substance (‘‘TestSubstance’’ fields) and the chemical structure (‘‘STRUCTURE’’ fields) in relation to any chemical-associated experiment record. DSSTox Standard Chemical Fields include chemical name, CASRN (if available), and test substance description (e.g., single chemical compound, macromolecule, mixture or formulation, etc). Where the test substance is not overly large (> 1800 amu) and can be reasonably represented by a single molecular structure, ‘‘STRUCTURE’’ fields are provided. These include a public standard, ‘‘molfile’’ of the chemical structure (a twodimensional projection of the three-dimensional structure) assigned to the substance, several fields automatically derived from the ‘‘molfile’’ structure (i.e., molecular weight, formula, IUPAC name, SMILES, SMILES_Parent, InChI, InChIKey), chemical type (i.e., defined organic, inorganic, organometallic), and a field indicating the relationship of the STRUCTURE to the TestSubstance (i.e., tested chemical, representative isomer in mixture, active ingredient in a formulation, etc.) (for more information, see http://www.epa. gov/ncct/dsstox/MoreonStandardChemFields.html). Assessment of chemical overlap between GEO and ArrayExpress DSSTox chemical-index files was determined on the basis of DSSTox TestSubstance identifiers. RESULTS Over 42 public Internet resources housing microarray data of potential toxicogenomics relevance were initially identified from various categories (Microarray World list of databases, http://www.microarrayworld.com/DatabasePage.html). From this list, we identified eight resources containing chemicalexposure–related content, and divided these into two categories: primary and secondary genomics resources. Primary genomics resources consist of the three MIAME-supportive, MGED-approved gene expression repositories: NCBI’s GEO, EBI’s ArrayExpress, and the Center for Information Biology Gene Expression (CIBEX) database (see Table 1 for listing of Sources, URLs, and references). Secondary genomics resources consist of five additional public genomics resources of potential toxicogenomics relevance that contain data gathered from chemical-exposure experiments in one or more laboratories (see Table 1 for listing of Sources, URLs, and references). A selection of public cheminformatics resources potentially useful for supporting a public toxicogenomics capability are listed in Supplemental Table 1. A brief description of survey results are given for each data resource below, followed by chemical-indexing results for the two major resources, ArrayExpress and GEO. ArrayExpress Repository ArrayExpress is the largest user-depositor data repository and MIAME-supportive public archive of microarray data in Europe, consisting of two parts—ArrayExpress Repository and the ArrayExpress Data Warehouse (Table 1). The ArrayExpress Repository currently exceeds 6900 experiments, and is 361 CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES TABLE 1 Primary and Secondary Genomics Data Resources with Content of Potential Use for Toxicogenomics Database Primary genomic resources ArrayExpress European Bioinformatics Institute (EBI); www.ebi.ac.uk/ microarray-as/ae/ GEO National Center for Biotechnology Information (NCBI), National Institutes of Health; www.ncbi. nlm.nih.gov/geo DNA Data Bank of Japan (DDBJ), National Institute of Genetics; http://cibex.nig.ac.jp/ McArdle Laboratory for Cancer Research, University of Wisconsin-Madison; http:// edge.oncology.wisc.edu/edge3.php National Institute of Environmental Health Sciences (NIEHS); http:// cebs.niehs.nih.gov/cebs-browser/ Center for Genetic Medicine Research; http://pepr. cnmcresearch.org/ Department of Biochemistry & Molecular Biology, Michigan State University; http:// dbzach.fst.msu.edu Mount Desert Island Biological Laboratory; http://ctd.mdibl.org CIBEX Secondary genomic resources Source/URL EDGE CEBS PEPR dbZach CTD indexed by Experiment Array Design and Protocol. Experiments can be queried by Keyword, Experimental Accession Number, Species, Experiment Type and Factors, Author, Laboratory, and Publication information (http://www.ebi. ac.uk/microarray-as/aer/entry). Repository data are cataloged, assessed for completeness, and assigned a MIAME score that represents the degree of MIAME compliance. The ArrayExpress Data Warehouse is based on more limited processed data results from the ArrayExpress Repository, currently contains 740 Expression Profiles (website accessed on November 14, 2008), and allows users to browse curated datasets from both a gene- and/or experiment-centric view. ArrayExpress has incorporated significant experimental content from GEO, which can be located within ArrayExpress by GEO Accession Identifiers. At the time of survey, ArrayExpress was not chemically indexed, nor did it contain additional information about the chemical tested other than the infrequently provided CASRN or ChEBI number. Chemical information may be located in the user-supplied protocols and free-text experimental description, or can be searched with the advanced query tools from ArrayExpress, including a keyword or text search in the Description field in ‘‘Query for Experiments.’’ These can also be combined with specifications of <Experiment type> ¼ References Public data deposition Programmatic access Ball et al., 2004; Brazma et al., 2006; Parkinson et al., 2007; Rustici et al., 2008 Barrett and Edgar, 2006; Barrett et al., 2007; Wheeler et al., 2008 Yes Yes (XML) Yes Yes (E-Utilities) Tateno and Ikeo, 2004 Yes No Hayes et al., 2005 No No Fostel et al., 2005; Waters et al., 2003; 2008 Yes No Chen et al., 2004 No No Burgoon et al., 2006 No No Davis et al., 2009 No No ‘‘compound treatment’’ or ‘‘dose response,’’ but these latter annotations are optionally utilized and not consistently applied by depositors to all chemical treatment experiments in the database. Chemical information can also be embedded within the ArrayExpress Sample-Data Relationship File (http:// tab2mage.sourceforge.net/docs/sdrf.html). In 2002, ArrayExpress introduced the Tox-MIAMExpress data entry method, optionally employed by data submitters to store toxicogenomics data in an effective manner (http:// www.ebi.ac.uk/miamexpress/). Tox-MIAMExpress was later discontinued; however, the ArrayExpress Accession Number Code, TOXM, designated to identify experiments for this purpose, is still available for use when requested by data submitters. Currently, the optional TOXM label is assigned to fewer than 25 experiments, but in these cases, typically more chemical identifier information, such as a CASRN and/or a ChEBI number, is provided by the submitter along with additional information recommended by the MIAME/Tox initiative (http://www.ebi.ac.uk/tox-miamexpress). Gene Expression Omnibus GEO is the largest user-depositor data repository and MIAME-supportive public archive of microarray data in the U.S. (Table 1), containing data from approximately 10,000 362 WILLIAMS-DEVANE, WOLF, AND RICHARD experiments at the time of this writing. In GEO, raw and/or processed data can be exported through the ftp website as well as through the main GEO Series website. User information, however, is entered using a free-text format that is subsequently curated. GEO allows for a wide range of informed queries with the Preview/Index window, where users can select data based on choices for each attribute of the experiment. The GEO repository has three key components: ‘‘Platform,’’ ‘‘Sample,’’ and ‘‘Series.’’ ‘‘Platform’’ provides a description of the array used in the experiment, as well as a data table defining the array template. The data table contains hybridization measurements for each element of the corresponding platform. ‘‘Sample’’ provides a description of the biological source and the experimental protocols. ‘‘Series’’ defines a set of related samples considered to be part of a study and describes the overall study aim and design. GEO has a complex, hierarchical structure that works with the NCBI E-Utilities, allowing one to query by submitter, organism, platform, sample type, sample titles, and release date. Similar to ArrayExpress, GEO hosts a smaller warehouse-type addition named ‘‘GEO Datasets and Profiles’’ containing processed, curated datasets that can be explored from both a gene- and/or experiment-centric view. Also, similar to Array Express, GEO is not chemically indexed nor does it consistently contain information about the chemical tested. Chemical names may be located in the submitter-deposited GEO Data Series fields—Title, Summary, Citation, or Samples—and are not consistently present in any single field. Chemical names are provided by the submitter, are rarely accompanied by CASRN or ChEBI identifiers, and do not undergo curation or review. Hence, as is the case with ArrayExpress, there is no easy or reliable way to identify a chemical-exposure–related experiment, there is no central listing of chemical content and, in both resources, we find that the chemical names embedded within user-deposited description fields are highly variable, prone to errors and misspellings, and frequently incorporate nonstandard abbreviations. Center for Information Biology Gene Expression The CIBEX database is a Japanese gene expression MIAME-supportive, MGED-approved user-depositor system (Table 1) that primarily serves experimenters from Asian countries. It is included for completeness sake, but currently does not contain significant chemical treatment content. However, the experimental protocol and detail standardization are noteworthy, with each record accompanied by a document containing full MIAME details. There is also a high level of curation and collaboration between CIBEX administrators and depositors that allows for missing information to be identified before publication, as well as for a high level of standardization and accuracy. At the time of this writing, CIBEX contains 32 experiments, only one of which is a chemical-exposure experiment, with CBX14 clearly labeled in the <Experiment Design Type> field as ‘‘compound_treatment_design.’’ Despite the high degree of standardization of this resource, however, there is currently no formal chemical annotation field accompanying a chemical treatment experiment. Environment, Drug, Gene Expression Database The EDGE database (Table 1) is a closed (i.e., not open to public user-deposits of data), curated system designed for the comparison, analysis and distribution of toxicogenomics information in a relational format. EDGE is chemical treatment centric and chemically indexed, with a toxicological focus. All experiments were performed in the Bradfield Laboratory using a standardized protocol involving custom cDNA arrays of minimally redundant hepatic clones, chosen through chemicalexposure experiments with prototype hepatic toxicants: 2,3,7,8, tetrachlorodibenzo-p dioxin (TCDD), cobalt chloride, and phenobarbital. The experimental conditions include 22 chemical treatments, 4 control treatments, and 1 environmental stressor (fasting) over 1 mutant (circadian wild-type control). All chemical treatments were chosen for the express purpose of investigating transcriptional profiles pertaining to hepatotoxicity in mice. Despite its small size and limited focus, EDGE incorporates a high level of standardization and comparability across species, array, experimental protocol, and experimental details, and demonstrates how a fully relational database built on such data can facilitate toxicogenomics investigation. However, EDGE is not a user-depositor system and currently lacks the tissue, species, and chemical diversity necessary for broader toxicogenomics exploration. Chemical Effects in Biological Systems CEBS is a public user-depositor data repository with an explicit toxicological and toxicogenomics focus (Table 1). CEBS can accommodate study design, timeline, clinical chemistry, and histopathology findings, as well as microarray and proteomics data. Each experiment in CEBS pertains to a chemical/environmental exposure or a genetic alteration in reference to clinical or environmental studies. CEBS has a complementary functional component known as the Biomedical Investigation Database (BID) (https://dir-apps.niehs. nih.gov/arc/), which is a relational database used to load and curate study data prior to exporting to public CEBS. BID also aids in the capture and display of novel data, including PCR and toxicogenomic-relevant fields, as used in ArrayExpress’s TOXM designation. CEBS is currently indexed by study and subject characteristics, such as environmental, chemical, or genetic stressor and stressor protocol, and includes observations on rat, mouse, and C. elegans. CEBS is one of the few genomics resource profiled in this survey, and the only resource with significant toxicogenomics-relevant microarray content, that incorporates formal chemical name annotation of experiments. At the time CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES of this writing, CEBS lists an inventory of 136 chemical names, or ‘‘chemical stressors,’’ associated with experimental content, along with a searchable CASRN field containing 121 entries. CEBS plans to incorporate additional chemical standards, including structure annotation, in collaboration with the EPA DSSTox project. Public Expression Profiling Resources Similar to EDGE, the PEPR database (Table 1) is a closed, curated system designed to serve as a public resource of gene expression profile data generated in the same laboratory, using the same chip type for three species, and subject to the same quality and procedural controls. PEPR is aimed at providing a standardized warehouse for the analysis of time-series data. The high degree of standardization within PEPR grants users comparability across arrays without laboratory and array bias, much like EDGE. PEPR adheres to quality control and standard operating procedures and is indexed by Principle Investigator, Tissue type, Experiment, and Organism, but has a very few chemical treatment–related experiments and lacks relational searching capabilities. However, the time-series query analysis tool (SGQT) enables the novel generation of graphs and spreadsheets showing the action of any transcript of interest over time. PEPR also differs from EDGE in the extensive data export options that include raw image files (.dat), processed image files (.cel) and interpretation files (.txt). PEPR also has external links to GEO, where PEPR data are mirrored through an automated export/import process. In PEPR, chemical information is stored in free-text fields such as the title, description, and array titles, similar to ArrayExpress and GEO. At the time of this writing, PEPR contains 72 experiments, of which 10 are determined to be chemical/environmental exposure experiments. Hence, PEPR currently covers very limited chemical space, but the SGQT tool for analysis of time-series microarray data, as well as the standardized chemical-exposure experiments are of potential value for toxicogenomics studies. DbZach dbZach, a laboratory tool offered for local installation, is of interest as a modular MIAME-compliant, toxicogenomicsupportive relational database designed to facilitate data integration, analysis, and sharing in support of mechanistic toxicology and toxicogenomics studies (Table 1). dbZach consists of several subsystems for the standardization of all data elements of a toxicogenomics experiment as well as traditional toxicological experiments and, additionally, has built-in functionality for data import and export of both raw data and processed data. Similar to EDGE, the dbZach project has created a sophisticated relational data environment for integrating and exploring many aspects of a toxicogenomics study. However, also similar to EDGE, this project is very narrowly focused in chemical space and primarily limited to estrogen and estrogenic chemicals. 363 Comparative Toxicogenomics Database The CTD is worthy of mention for its toxicogenomics relevance, but is not a traditional genomics database (Table 1). Rather, it is a database of curated relationships between chemicals, genes, and diseases mined from journal articles. CTD provides text-mineable access to the toxicogenomic literature, but currently provides direct linkage to only one secondary genomics resource, that is, EDGE. Also worthy of note, CTD uses the chemical subset of the NLM MESH vocabulary to provide formal chemical annotation of its content and to link to various chemically indexed toxicology resources (http://ctd.mdibl.org/resources.jsp?type ¼ chem). The present CTD inventory of over 4400 chemical substances also has recently been deposited into the NCBI PubChem resource (http://pubchem.ncbi.nlm.nih.gov/; Supplemental Table 1) to offer structure-searchability and broader access to chemically indexed resources. Table 2 compares the above inventory of genomics resources from the standpoint of being chemically indexed (i.e., chemical identifiers are required and entered in standard fields), MIAME-supportive, and standardized with respect to various experimental descriptions. Additional details on the comparison of the primary and secondary genomics resources identified in this study with respect to the types of gene expression data stored, toxicological focus, formats of data available for download (raw or processed), ability to query data, ability to import or export experimental data, and programmatic access are presented in Supplemental Table 2. Web-based queries and programmatic access were used in the present study to extract current experimental content from ArrayExpress Repository and GEO Series, and to identify corresponding experiment annotation fields (resulting from adherence to MIAME guidelines in the two systems) that could be mapped to common fields to enable comparisons across the two inventories. We implemented a set of 14 Standard Genomics Fields in Table 3 to serve this purpose and to confer read-across capability between the two inventories. All but two of these fields map to existing MIAME-compliant data fields, which vary only slightly in name in GEO and ArrayExpress (see expanded columns in Supplemental Table 3) and, thus, are straightforward to implement. One new field, ‘‘Experiment_URL,’’ contains a static URL link to enable outside Internet access directly to the experiment accession summary page in either ArrayExpress or GEO. The last field, ‘‘Chemical_StudyType,’’ has no corresponding field in either ArrayExpress or GEO, and was introduced by us to begin to address the currently missing chemical annotation layer for gene expression experiments in both resources. DSSTox chemical-index files for GEO and ArrayExpress created by the above methods are publicly available for download from the DSSTox website (http://epa.gov/ncct/ dsstox/). In addition to the main DSSTox chemical-index files that include one record for each unique chemical (i.e., unique 364 WILLIAMS-DEVANE, WOLF, AND RICHARD TABLE 2 Standardization and Indexing of Genomics Data Resources Standardizedb Data resourcea Indexed by chemical MIAME-supportive Species Array information Experimental protocol Experimental details Allows relational searchingc ArrayExpress GEO CIBEX EDGE CEBS PEPR dbZach CTD þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ þ NA þ þ þ þ þ NA þ þ þ þ þ þ þ þ þ a Refer to Table 1 for full names, sources, URLs, and references associated with these data resources; feature present (þ) or absent (). Standardized entries refer to internal content adhering to controlled vocabularies, and represented in defined and required fields; NA, not applicable. c Relational searching refers to the ability to construct AND/OR-type queries across the content of defined fields. b test substance) in the ‘‘Treatment’’ category in each of the two repositories, we have published Auxiliary files that include DSSTox Standard Chemical Fields, Standard Genomics Fields (14), and additional Source-specific experiment description fields (33 for ArrayExpress, 4 for GEO) for the full chemicalassociated experiment inventories in the two resources (i.e., one record for each chemical-experiment pair). Detailed descriptions of the content of these files and their incorporation into the DSSTox Structure-Browser (http://www.epa.gov/ dsstox_structurebrowser/) and PubChem, the results of which enable structure-based Internet linkages directly to ArrayExpress and GEO experiment summary pages, are provided elsewhere (Williams-Devane et al., 2009). Table 4 provides a breakdown of the current chemicalassociated experimental content within ArrayExpress Repository and GEO Series according to all Chemical_StudyType TABLE 3 Standard Genomics Fields for Common Indexing of Experiments Contained in ArrayExpress Repository and GEO Series Field name Experiment_Accession Experiment_AlternativeAccession Experiment_IdNumber Experiment_Title Experiment_Description Experiment_URL Experiment_PubMed_Information Experiment_PublicationDate Species Number_Samples Experiment_ArrayAccession Experiment_ArrayType Experiment_ArrayTitle Chemical_StudyType:a Reference Treatment Vehicle Combination_Treatment Media Not_Enough_Information a Description A unique combination of informative prefix and number used to identify each dataset. An alternate accession number. Example: GEO files in ArrayExpress have GSE#### (GEO Series) secondary Accession number for users to find the same data in GEO). A unique identification number for each experiment within each database. The title of the experiment. A free-text, user-submitted description of the experiment or dataset. URL links to the Source Experimental Download Page. A unique number that links users to each PubMed publication associated with each experiment or dataset. Date indicating when the dataset was released to the public or published. Species as listed by the user. Number of samples used within a microarray experiment or dataset. An accession number for each array design or platform. Details about the platform used or details about data other than raw data that users have submitted. The user-submitted title of the Array/Platform used in the experiment. A designation of the role of the identified chemical in the given experiment. Allowed entries are listed as Subsections to this field (e.g., Reference, Treatment, Vehicle, . . .). Chemical used to mimic a biological or environmental situation. The primary focus of experiment or study is to understand the transcriptomic effects of the chemical. Chemical used to aid the administration of the treatment to the organism, such as dimethyl sulfoxide. Multiple chemicals used together for treatment purposes (see ‘‘Treatment’’ above). Chemical used in maintenance of the tissue culture or sample conditions, such as phosphate buffered saline. Sufficient information is not present in the experimental description to determine the role of the chemical. Subsections to the Chemical_StudyType field have allowed entries: Reference, Treatment, etc., with linkage text ‘‘AND’’ used for combinations (e.g., TreatmentANDReference). 365 CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES TABLE 4 Classification of Chemically Indexed Genomics Experiments in ArrayExpress and GEO by ‘‘Chemical_StudyType’’ Databasea ArrayExpress Repository GEO Series Chemical_StudyType Classificatione Breakdown of Total no. of Chemical-Experiment Records (Unique Chemicals)f Total no. of ChemicalCombination Total no. of Experiment Total no. of Multiple Treatmentg Classifications Other Experimentsb Recordsc Unique Chemicalsd Treatmentg Reference Vehicle Media 6346 9957 2365 2381 1011 1064 1609 (810) 266 (157) 138 (26) 111 (68) 1951 (838) 152 (60) 81 (48) 14 (14) 109 (91) 72 (38) 118 (83) 111 (67) 14 (10) 0 (0) a All numbers relate to database content extracted on September 20, 2008. Total number of experiments contained in the public resource (also corresponds to the number of unique Accession IDs). c Number of Chemical-Experiment pairs extracted from the Total no. of Experiments prior to determination of the Chemical_StudyType Classification, where some experiments in the Total no. of Experiments map to no chemicals, and some experiments involving multiple chemicals map to more than one ChemicalExperiment record. d Total number of unique chemical test substances (i.e., no chemical test substance identity is duplicated) identified in the total group of Chemical-Experiment records, irrespective of Chemical_StudyType Classification. e Definitions of Chemical_StudyType Classifications are provided in Table 3. f Number of Chemical-Experiment Records corresponding to each Chemical_StudyType category (with corresponding number of unique chemicals in parentheses), where for the purposes of this table one record is assigned to one category and if the chemical is used for different purposes within one experiment (e.g., TreatmentANDReference), it is assigned to the ‘‘Multiple Classifications’’ category. g Number of Chemical-Experiment Records (with corresponding number of unique chemicals in parentheses) out of the total group of Chemical-Experiment Records that are associated with the ‘‘Treatment’’ category according to the criteria for a chemical-exposure scenario set forth in this paper; any record labeled as ‘‘Treatment’’ or ‘‘CombinationTreatment’’ (alone or in combination with other Chemical_StudyType labels, e.g., TreatmentANDReference), are included in the final DSSTox chemical-index file. b categories (all counts correspond to data extracted on September 20, 2008). Of the 6346 total ArrayExpress experimental descriptions downloaded, more than a third (2365) were identified as chemical-associated experiments by the procedures outlined in the Methods section, corresponding to 1011 unique chemical test substances. Similarly, of the 9957 GEO Series experimental descriptions downloaded, nearly a quarter (2381) were identified as chemical-associated experiments, corresponding to a total of 1064 unique chemical test substances (Table 4). Table 5 provides a breakdown of the ‘‘Treatment’’associated experimental content within the ArrayExpress Repository and GEO Series according to DSSTox chemical classification categories. Of the 1835 total ‘‘Treatment’’associated experiments in the ArrayExpress Repository, 1282 experiments (or 70% of the total) are associated with a ‘‘defined organic’’ chemical test substance (note that multiple experiments can map to the same chemical). GEO Series contains a similarly high percentage of ‘‘Treatment’’ experiments associated with a defined organic chemical test substance, that is, 1544/2134, or 72%. The above indicators give a rough sense of the size of the inventory of microarray experiments associated with defined organics in the public domain. Table 5 also provides indications of the size of the chemical space associated with these ‘‘Treatment’’ experiments. Of the total number of unique chemical test substances associated with the ‘‘Treatment’’ category of experiments in ArrayExpress Repository, 628/887, or 71% correspond to defined organics. Although these include some drugs, small peptides and biologics with molecular weights ranging 600–1700 amu, the majority (> 90%) are small molecular weight (< 500 amu) organic chemicals for which a chemical structure can be assigned and that tend to be of greatest interest for environmental toxicology and structure-activity relationship models and inferences. A similar percentage applies to GEO, that is, 751/ 1014, or 74% of unique chemical test substances associated with ‘‘Treatment’’ experiments correspond to defined organics. Hence, both resources span a relatively large number of unique defined organic chemicals, which implies a broad chemical diversity associated with public microarray experiments. Within ArrayExpress, the chemical that maps to the largest number of chemical-experiments is ‘‘estradiol,’’ occurring in 53 experiments, 44 of which are classified as ‘‘Treatment’’ experiments. Comparison of GEO and ArrayExpress Experimental and Chemical Content Application of DSSTox Standard Chemical Fields and the set of 14 Standard Genomics Fields enable direct comparison of GEO Series and ArrayExpress Repository experimental and chemical content. In addition, the DSSTox Auxiliary files for ArrayExpress include a number of easily extracted field characteristics affiliated with each experiment, including Array/Platform type, Species, and the MIAME Score and its five subcategories: Array or Platform information, Factor information, Raw Data information, Processed Data information, and Protocol information. The latter annotations are particularly valuable for assessing the sufficiency of the experimental data for reanalysis. The distribution of ArrayExpress ‘‘Treatment’’ chemicalexperiments assigned to these various categories of experimental description is provided in Table 6. The distribution of MIAME 366 WILLIAMS-DEVANE, WOLF, AND RICHARD TABLE 5 Classification of Chemically Indexed ‘‘Treatment’’ Genomics Experiments in ArrayExpress and GEO by DSSTox Chemical Classification Databasea ArrayExpress repository GEO series DSSTox Chemical Classificatione Breakdown for ‘‘Treatment’’ Chemical-Experiment Records (Unique Chemicals)f Total no. of ChemicalExperiment Recordsb Total no. of ‘‘Treatment’’ ChemicalExperiment Recordsc Total no. of Unique Chemicalsd No structureg Defined organic Inorganic Organometallic 2365 1835 887 373 (179) 1282 (628) 153 (60) 27 (20) 2381 2134 1014 346 (173) 1544 (751) 210 (71) 34 (19) a All numbers relate to database content extracted on September 20, 2008. See Table 4. c Total number of Chemical-Experiment records assigned to any ‘‘Treatment’’ Chemical_StudyType (e.g., Treatment, CombinationTreatment, Treatment&Reference, etc.) according to the criteria for a chemical-exposure scenario set forth in this paper. d Total number of unique chemical test substances (i.e., no chemical test substance identity is duplicated) identified in the total group of ‘‘Treatment’’ ChemicalExperiment Records. e Refers to DSSTox Standard Chemical Field Definition and allowed entries for STRUCTURE_ChemicalType (http://www.epa.gov/ncct/dsstox/ CentralFieldDef.html#STRUCTURE_ChemicalType). f Number of ‘‘Treatment’’ Chemical-Experiment Records corresponding to each Chemical Classification category (with corresponding number of unique chemical substances in parentheses), where each record maps to a single chemical classification and the list of unique chemicals for this ‘‘Treatment’’ subset of experiments constitutes the final DSSTox structure-index file. g Number of ‘‘Treatment’’ Chemical-Experiment Records (with corresponding # of unique chemicals) where the chemical test substance is identified, but not assigned to a DSSTox chemical structure, for example, this can be an undefined mixture, polymer, or macromolecule. b scores is particularly illuminating. A total MIAME Score of 5 indicates that all components of the MIAME compliance criteria have been included by the submitter. Only 18% (or 216) of the chemical treatment experiments in the ArrayExpress Repository have all five components of information, whereas 50% (or 596) have four components of information. Most noteworthy for this subset of ‘‘Treatment’’ experiments, however, raw data information is missing for 29% (or 347), processed data is missing for 11% (or 131), and protocol is missing for 21% (or 291) (Table 6). Given that these are essential experimental components for the reanalysis of gene expression data, these numbers limit the number of chemical treatment experiments within ArrayExpress that are potentially useful for broader toxicogenomics investigation. The ArrayExpress Repository has experienced steep growth in the past few years, largely as a result of the integration of GEO experimental content (approximately 4500 experiments were added from January 2007 to January 2008). ArrayExpress files with E-GEOD-XXXX accession numbers mirror GEO Series entries and currently represent more than 50% of the chemicalexposure, or ‘‘Treatment’’ experiments in the ArrayExpress Repository (Fig. 1). Figure 1 also shows that the total number of chemical-experiment pairs (a pair being a 1:1 mapping of chemical to experiment) and total number of ‘‘Treatment’’ chemical-experiment pairs identified in the current study are comparable between ArrayExpress and GEO, with greater than 50% overlap of chemical-experiment pairs in all categories. Unlike ArrayExpress, GEO Series currently provides no MIAME scoring of content. However, because a significant portion of the ‘‘Treatment’’ experiments represented in GEO Series has been incorporated into the ArrayExpress Repository, it was possible to create a table summarizing these experimental factors for the subset of GEO chemical ‘‘Treatment’’ experiments contained within ArrayExpress (Table 6). Only 11% (or 81) of the GEO records in ArrayExpress are assigned a MIAME Score of 5; however, 56% (or 415) have a MIAME score of 4. A much greater percentage, 45% (or 335) of GEO records in ArrayExpress, have corresponding Raw Data, whereas 100% (or 745) have Processed Data (most likely a precondition for inclusion of GEO experiments in ArrayExpress). Figure 2 presents overlap of the unique chemical content pertaining to the ‘‘Treatment’’ chemical-experiment category. Assessment of chemical overlap between GEO and ArrayExpress DSSTox files was determined on the basis of DSSTox ‘‘TestSubstance’’ identifiers. The steroids, estradiol and dexamethasone, are associated with the largest numbers of microarray experiments in both cases, and the largest number of shared experiments as well. Other test substances most commonly associated with experiments in either GEO or ArrayExpress include Ethanol, 2,3,7,8-TCDD, Retinoic Acid, and Trichostatin, each of which is of broad toxicological interest. Assessing Toxicological relevance of GEO and ArrayExpress Chemical Content DSSTox chemical structure annotation enables, for the first time, an examination of the chemical diversity and coverage of GEO Series and ArrayExpress Repository experiments. We CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES TABLE 6 Characteristics of the ArrayExpress Repository pertaining to ‘‘Treatment’’ Experiments (based on Data Extracted on September 20, 2008) Characteristics Array/platform Species MIAMEScore_Totalb MIAMEScore_Arrayc MIAMEScore_Factord MIAMEScore_RawDatae MIAMEScore_ ProcessedDataf MIAMEScore_Protocolg Major characteristic value Affymetrix Agilent Other Homo sapiens Mus musculus Rattus Arabidopsis Saccharomyces cerevisiae Other 5 4 3 2 1 0 1 0 1 0 1 0 1 0 1 ArrayExpress GEO series from ArrayExpress Number (%) of ‘‘treatment’’ experimentsa Number (%) of ‘‘treatment’’ experimentsa 861 (73%) 82 (7%) 238 (20%) 377 (32%) 317 (27%) 173 (15%) 159 (13%) 55 (5%) 691(93%) 54 (7%) 0 (0%) 264 (35%) 220 (30%) 126 (17%) 76 (10%) 15 (2%) 100 (8%) 216 (18%) 595 (50%) 309 (26%) 55 (5%) 6 (1%) 78 (7%) 1103 (93%) 551 (47%) 630 (53%) 347 (29%) 834 (71%) 131 (11%) 1050 (89%) 291 (21%) 890 (79%) 44 (6%) 81(11%) 415 (56%) 211 (28%) 38 (5%) 0 (0%) 0 (0%) 745 (100%) 421 (57%) 324(43%) 335 (45%) 410 (55%) 0 (0%) 745 (100%) 195 (26%) 550 (74%) a Note that the total number of ‘‘Treatment’’ experiments (or studies) will be less than the total number of ‘‘Treatment’’ chemical-experiment pairs in Table 5 due to inclusion of experiments/studies that have tested multiple chemicals (and/or used multiple platforms, etc.). b The Total MIAME score ranges from 0 to 5 and is a sum of the independent scores of the five subcomponent scores, each of which takes on the value of either 0 or 1 (absent or present). c Specific information about the design of the array or the platform used was submitted (1) or not submitted (0) with the experiment by the submitter. Included Array information is assigned an Array Accession number (see Experimental_Accession, Table 3) within ArrayExpress. d A list of experimental factors was submitted (1) or not submitted (0) with the experiment by the submitter; factors might include information on the cell line or particular compounds and dose information used in the experiment. e Raw data was submitted (1) or not submitted (0) with the experiment by the submitter. f Processed data was submitted (1) or not submitted (0) with the experiment by the submitter. g Specific information about the experimental protocols used in the experiment was submitted (1) or not submitted (0) with the experiment by the submitter. Included Protocol information is assigned a Protocol Accession number within ArrayExpress. found significant numbers of experiments in both resources mapped to families of similar chemicals, as well as to a broad diversity of chemical structures, spanning a wide range of 367 toxicologically relevant chemical functional hierarchies and classes (Supplemental Fig. 2). A further metric of toxicological relevance is provided by the overlap of unique ‘‘Treatment’’ chemical substances in GEO and ArrayExpress with the current published DSSTox inventory, which includes more than 10,000 unique chemical substances, and spans a variety of environmentally and toxicologically relevant chemical inventories and data sets from various sources, including EPA, the National Toxicology Program, and the U.S. Food and Drug Association (Richard et al., 2008). At the time of this survey, more than 550 unique chemical substances in the DSSTox GEO and/or ArrayExpress files (GEOGSE and ARYEXP) corresponding to ‘‘Treatment’’ experiments are contained within one or more of the 11 previously published DSSTox Data Files (http://www.epa.gov/ ncct/dsstox/DataFiles.html), and there are a total of 1294 overlapping instances (i.e., some chemicals occur in multiple DSSTox Data Files). Of these overlapping instances, three chemical substances (Bisphenol A, di(2-ethylhexyl) phthalate and dibutylphthalate) occur in eight DSSTox Data Files, and a total of 309 chemical substances occur in two or more DSSTox Data Files. These numbers indicate that significant numbers of GEO and Array Express ‘‘Treatment’’ chemicalexperiments correspond to chemicals of potential toxicological concern, for which additional in vitro or in vitro data may exist. DISCUSSION The term ‘‘chemogenomics’’ has been proposed to more generally encompass the overlap of genomics technologies with treatment-related chemical effects on biological systems, including both toxicity-related and therapeutic effects (Fielden and Kolaja, 2006). Chemogenomics adds a top-most chemical layer to data organization, with broad chemical coverage of standardized-protocol experiments a key requirement for discerning activity patterns that can be confidently extrapolated across chemical space. This approach and its implementation are perhaps best exemplified by the Iconix DrugMatrixR database and applications (Ganter et al., 2005). The Iconix database consists of data generated for a single species (rat), treated by more than 600 compounds in seven tissue types, representing upwards of 3200 different drug-dose-time-tissue combinations. The database covers five different domains of data: microarray, clinical chemistry, hematology, organ weight, and histopathology, and was built using a common microarray platform and stringent experimental protocols and standards for data generation and processing. Whereas the Iconix database represents an ideal, practically speaking, it is far removed from the reality of a public microarray resource, upon which most public toxicogenomics investigations must rely. In their role as primary repositories of data associated with the published scientific literature, public microarray data repositories such as GEO and ArrayExpress 368 WILLIAMS-DEVANE, WOLF, AND RICHARD FIG. 1. Comparison of numbers of GEO Series and ArrayExpress Repository experiments, chemical-experiment pairs, and ‘‘Treatment’’ chemical-experiment pairs, also showing overlapping content between the two systems; refer to totals and legends in Tables 4 and 5 (based on data extracted September 20, 2008). cannot limit their content to include only experiments adhering to strict common protocol standards and traditional model organisms. A public data resource can, however, strive for completeness and accuracy of experimental annotations and to provide user-access to raw data for reanalysis. Similarly, the accurate identification of a chemical in relation to an experiment, particularly where the primary purpose of the experiment is to discern effects of the chemical on a biological system, should be considered as primary experimental annotation and absolutely essential to experimental reproducibility. Whereas standardization and chemical indexing of microarray experiments at the time of data deposition and publication is the ideal, if minimally sufficient information (i.e., a valid chemical name, along with specification of the purpose FIG. 2. Comparison of the total sets of unique chemicals pertaining to Treatment Chemical-Experiment pairs in ArrayExpress Repository and GEO Series from the DSSTox data files; shown in each section are the chemicals mapping to the largest number of ‘‘Treatment’’ Chemical-Experiments in each case, with the number of experiments shown in parentheses (GEO/ArrayExpress) (based on data extracted September 20, 2008). CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES of the chemical in relation to the experiment) is collected at the time of data deposition in required data fields, formal chemical indexing with structure annotation and quality review can be performed efficiently with the appropriate chemical expertise in collaboration with public efforts such as DSSTox and ChEBI (Supplemental Table 1). As the present survey has shown, although a number of public microarray resources have the potential to support toxicogenomics investigations, these resources currently represent a patchwork of disconnected or loosely connected inventories and capabilities (Larsson and Sandberg, 2006), having different goals, degrees of standardization, public data accessibility, data mining ability, and utility for toxicogenomics investigation (Table 2; Supplemental Table 2). Primary genomics resources (Table 1) serve as official MGEDsanctioned repositories for public gene expression data associated with the scientific literature (Mattes et al., 2004; Salter, 2005) with GEO and ArrayExpress, by far, the largest and most important resources, currently. They both are MIAME-supportive databases, meaning that they accept all information about an experiment set forth by the MIAME guidelines; however, they do not actually require this information. In addition, there is insufficient standardization currently within GEO or ArrayExpress pertaining to protocol or experimental description to fully support exploration within and across these resources (Table 2). Secondary genomics resources contain genomic-related data but generally have more limited content and are designed for more specialized purposes and applications (Table 1). With its specific focus on toxicogenomics, attention to chemical indexing, and addition of the BID system, CEBS is worthy of special mention, having incorporated many elements of an ideal toxicogenomics resource. To support robust relational searching for toxicogenomics, CEBS has the added task of capturing and systematizing user-deposited data pertaining to a study. CEBS bridges the gap between an open-access, user-depositor system and a relational, curated database by instituting a high degree of standardization and data controls that extend beyond MIAME guidelines (Fostel et al., 2005). CEBS is striving for much larger coverage of chemical space in relation to chemical treatment experiments. In collaboration with DSSTox, and building on current annotation efforts of GEO and ArrayExpress, CEBS will provide structure-searching capabilities and chemical linkages to external public resources, such as PubChem. In addition, CEBS will provide direct access to GEO and ArrayExpress ‘‘Treatment’’ chemical-experiment content, as well as automated secondary deposition of CEBS content to GEO. A most noteworthy deficiency of most secondary genomics resources and the two primary genomics resources—ArrayExpress and GEO—highlighted in the present study, is the complete lack of incorporation of chemical annotation and standards that would allow aggregation of data for the same or similar chemicals, and linkage to growing lists of chemically 369 indexed resources (Supplemental Table 2). Due to the lack of chemical-reporting standards, the process for identifying chemical treatment-related experiments in ArrayExpress Repository and GEO Series in this study was time-consuming and difficult to automate (Supplemental Fig. 1 and Supplemental Example 1). Present efforts serve to highlight deficiencies in microarray experiment data deposition requirements and standards with regard to chemistry and chemical treatment– related experiments that, if better addressed, could greatly facilitate chemical annotation and data integration efforts in the future. With formal chemical annotation, it becomes possible to assess the chemical coverage of public gene expression databases, to link data for common or similar chemicals across information domains, including toxicology, as well as to gather data from comparable experiments, possibly performed in different labs and species, that can begin to serve as the basis for meta-analysis or structure-activity hypotheses. Furthermore, the proposed set of Standard Genomics Fields, most of which map to existing fields from both GEO and ArrayExpress, serve to bridge the two resources and facilitate comparisons and incorporation of their content into other resources in a standardized way. CONCLUSION It is hoped that the current exercise to create, publish, and link chemical-index files for GEO Series and ArrayExpress Repository has had two primary impacts: (1) to highlight deficiencies in the current chemical annotation and curation methods within ArrayExpress and GEO that particularly impact toxicogenomics applications of these resources; and (2) to show the way forward in terms of the potential benefits that can be derived by incorporating robust chemical annotation and linkages of chemical treatment-related content to these public resources. Recently improved coordination of the EBI ArrayExpress and ChEBI projects, whereby ChEBI provides link-outs from chemical structure to particular ArrayExpress experiments (currently only provided for a handful of experiments for which ArrayExpress data submitters provided ChEBI identifiers), is a significant step forward and should immediately benefit from incorporation of the DSSTox ArrayExpress chemical-experiment index file, as well as the addition of the corresponding DSSTox GEO index file. However, as is apparent from past failures, it is not sufficient to recommend that users add accurate chemical information at the time of data submission unless more stringent efforts to require this information are instituted. In addition, we strongly recommend adoption of the ‘‘Chemical_StudyType’’ categories, or something comparable, for each chemical-associated study or experiment deposited into GEO and ArrayExpress. Finally, recognizing that GEO and ArrayExpress are not designed primarily as toxicogenomics resources, submitters of explicit toxicogenomic study data should be strongly encouraged to 370 WILLIAMS-DEVANE, WOLF, AND RICHARD initially deposit studies into CEBS as a way to ensure capture of sufficient toxicogenomics experimental description, utilizing the automated deposition capabilities of CEBS to secondarily deposit well annotated, chemically indexed data into GEO. Postscript: All of the initial published DSSTox chemical files and results reported here were based on data extracted from the ArrayExpress Repository and GEO Series on September 20, 2008. Subsequent updates of both ArrayExpress Repository and GEO Series chemical-index files, based on data extracted on January 20, 2009 and February 2, 2009, respectively, have been published on the DSSTox website and incorporated into PubChem as of March 2009; these updated files do not change the overall trends or conclusions of the present study. Davis, A. P., Murphy, C. G., Saraceni-Richards, C. A., Rosenstein, M. C., Wiegers, T. C., and Mattingly, C. J. (2009). Comparative toxicogenomics database: A knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res. 37(Database issue), D786–D792. SUPPLEMENTARY DATA Dix, D. J., Houck, K. A., Martin, M. T., Richard, A. M., Setzer, R. W., and Kavlock, R. J. (2007). The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol. Sci. 95, 5–12. Supplementary data are available online at http://toxsci. oxfordjournals.org/. FUNDING NCSU/EPA Cooperative Training Program in Environmental Sciences Research, Training Agreement (CT833235-01-0) with North Carolina State University supported C.R.W. ACKNOWLEDGMENTS We would like to thank Drs Jennifer Fostel (CEBS), Chihae Yang (FDA Center for Food Safety and Nutrition), David Dix (EPA), and William Ward (EPA) for helpful comments and suggestions in review of this manuscript. This work was carried out by C.R.W. as part of a graduate research project within the Bioinformatics Program at North Carolina State University; thesis is publicly accessible at http://www.lib. ncsu.edu/theses/available/etd-12112008-214342/. REFERENCES Ball, C., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P., Matese, J. C., Icahn, C., Parkinson, H., Quackenbush, J., et al. (2004). Microarray Gene Expression Data (MGED) Society. Standards for microarray data: An open letter. Environ. Health Perspect. 112, A666–A667. Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D., Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and Edgar, R. (2007). NCBI GEO: Mining tens of millions of expression profiles—Database and tools update. Nucleic Acids Res. 35(Database issue), D760–D765. Barrett, T., and Edgar, R. (2006). Gene expression omnibus: Microarray data storage, submission, retrieval, and analysis. Methods Enzymol. 411, 352–369. Boyle, J. (2005). Gene-Expression Omnibus integration and clustering tools in SeqExpress. Bioinformatics 21, 2550–2551. Brazma, A., Kapushesky, M., Parkinson, H., Sarkans, U., and Shojatalab, M. (2006). Data storage and analysis in ArrayExpress. Methods Enzymol. 411, 370–386. Burgoon, L. D. (2007). Clearing the standards landscape: The semantics of terminology and their impact on toxicogenomics. Toxicol. Sci. 99, 403–412. Burgoon, L. D., Boutros, P. C., Dere, E., and Zacharewski, T. R. (2006). dbZach: A MIAME-compliant toxicogenomic supportive relational database. Toxicol. Sci. 90, 558–568. Chen, J., Zhao, P., Massaro, D., Clerch, L. B., Almon, R. R., DuBois, D. C., Jusko, W. J., and Hoffman, E. P. (2004). The PEPR GeneChip data warehouse, and implementation of a dynamic time series query tool (SGQT) with graphical interface. Nucleic Acids Res. 32(Database issue), D578–D581. Fielden, M. R., and Kolaja, K. L. (2006). The state-of-the-art in predictive toxicogenomics. Curr. Opin. Drug Discov. Dev. 9, 84–91. Fostel, J. M. (2008). Towards standards for data exchange and integration and their impact on a public database such as CEBS (Chemical Effects in Biological Systems). Toxicol. Appl. Pharmacol. 233, 54–62. Fostel, J. M., Burgoon, L., Zwickl, C., Lord, P., Corton, J. C., Bushel, P. R., Cunningham, M., Fan, L., Edwards, S. W., Hester, S., et al. (2007). Toward a checklist for exchange and interpretation of data from a toxicology study. Toxicol. Sci. 99, 26–34. Fostel, J., Choi, D., Zwickl, C., Morrison, N., Rashid, A., Hasan, A., Bao, W., Richard, A., Tong, W., Bushel, P., et al. (2005). Chemical Effects in Biological Systems—Data dictionary (CEBS-DD): A compendium of terms for the capture and integration of biological study design description, conventional phenotypes, and ‘omics data. Toxicol. Sci. 88, 585–601. Ganter, B., Tugendreich, S., Pearson, C. I., Ayanoglu, E., Baumhueter, S., Bostian, K. A., Brady, L., Browne, L. J., Calvin, J. T., Day, G. J., et al. (2005). Development of a large-scale chemogenomics database to improve drug candidate selection and to understand mechanisms of chemical toxicity and action. J. Biotechnol. 119, 219–244. Gomase, V. S., Tagore, S., and Kale, K. V. (2008). Microarray: An approach for current drug targets. Curr. Drug Metab. 9, 221–231. Hamadeh, H. K., Amin, R. P., Paules, R. S., and Afshari, C. A. (2002). An overview of toxicogenomics. Curr. Issues Mol. Biol. 4, 45–56. Hayes, K. R., Vollrath, A. L., Zastrow, G. M., McMillan, B. J., Craven, M., Jovanovich, S., Rank, D. R., Penn, S., Walisser, J. A., Reddy, J. K., et al. (2005). EDGE: A centralized resource for the comparison, analysis, and distribution of toxicogenomic information. Mol. Pharmacol. 67, 1360–1368. Hirabayashi, Y., and Inoue, T. (2002). Toxicogenomics—A new paradigm of toxicology and birth of reverse toxicology. Kokuritsu Iyakuhin Shokuhin Eisei Kenkyusho Hokoku 120, 39–52. Ivliev, A. E., ‘t Hoen, P. A., Villerius, M. P., den Dunnen, J. T., and Brandt, B. W. (2008). Microarray retriever: A web-based tool for searching and large scale retrieval of public microarray data. Nucleic Acids Res. 36, W327–W331. Larsson, O., and Sandberg, R. (2006). Lack of correct data format and comparability limits future integrative microarray research. Nat. Biotechnol. 24, 1322–1323. Martin, M. T., Judson, R. S., Reif, D. M., and Dix, D. J. (2009). Profiling chemicals based on chronic toxicity results from the U. S. EPA ToxRef Database. Environ. Health Perspect. 117, 392–399. Mattes, W. B., Pettit, S. D., Sansone, S. A., Bushel, P. R., and Waters, M. D. (2004). Database development in toxicogenomics: Issues and efforts. Environ. Health Perspect. 112, 495–505. CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N., Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M., et al. (2007). ArrayExpress—A public database of microarray experiments and gene expression profiles. Nucleic Acids Res. 35(Database issue), D747–D750. Richard, A., Yang, C., and Judson, R. (2008). Toxicity data informatics: Supporting a new paradigm for toxicity prediction. Toxicol. Mech. Methods 18, 103–118. Richard, A. M., Gold, L. S., and Nicklaus, M. C. (2006). Chemical structure indexing of toxicity data on the internet: Moving toward a flat world. Curr. Opin. Drug Discov. Dev. 9, 314–325. Rustici, G., Kapushesky, M., Kolesnikov, N., Parkinson, H., Sarkans, U., and Brazma, A. (2008). Data storage and analysis in ArrayExpress and expression profiler. Curr. Protoc. Bioinformatics. Chap. 7, Unit 7.13. Salter, A. H. (2005). Large-scale databases in toxicogenomics. Pharmacogenomics 6, 749–754. Tateno, Y., and Ikeo, K. (2004). International public gene expression database (CIBEX) and data submission. Tanpakushitsu Kakusan Koso 49, 2678–2683. Taylor, C. F., Field, D., Sansone, S. A., Aerts, J., Apweiler, R., Ashburner, M., Ball, C. A., Binz, P. A., Bogue, M., Booth, T., et al. (2008). Promoting coherent minimum reporting guidelines for biological and biomedical investigations: The MIBBI project. Nat. Biotechnol. 26, 889–896. Waters, M., Boorman, G., Bushel, P., Cunningham, M., Irwin, R., Merrick, A., Olden, K., Paules, R., Selkirk, J., Stasiewicz, S., et al. (2003). Systems toxicology and the chemical effects in biological systems (CEBS) knowledge base. Environ. Health Perspect. Toxicogenomics 111, 15–28. 371 Waters, M., Stasiewicz, S., Merrick, B. A., Tomer, K., Bushel, P., Paules, R., Stegman, N., Nehls, G., Yost, K. J., Johnson, C. H., et al. (2008). CEBS—Chemical effects in biological systems: A public data repository integrating study design and toxicity data with microarray and proteomics data. Nucleic Acids Res. 36(Database issue), D892–D900. Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K., Chetvernin, V., Church, D. M., Dicuccio, M., Edgar, R., Federhen, S., et al. (2008). Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 36(Database issue), D13–D21. Williams-Devane, C. R., Wolf, M. A., and Richard, A. M. (2009). DSSTox Chemical-Index files for exposure-related experiments in ArrayExpress and Gene Expression Omnibus: Enabling toxico-chemogenomics data linkages. Bioinformatics 25, 692–694. Yang, C., Benz, R. D., and Cheeseman, M. A. (2006a). Landscape of current toxicity databases and database standards. Curr. Opin. Drug Discov. Dev. 9, 124–133. Yang, C., Hasselgren, C. H., Boyer, S., Arvidson, K., Aveston, S., Diekes, P., Benigni, R., Benz, R. D., Contrera, J., Kruhlak, N. L., et al. (2008). Understanding genetic toxicity through data mining: The process of building knowledge by integrating multiple genetic toxicity databases. Toxicol. Mech. Methods 18, 277–295. Yang, C., Richard, A. M., and Cross, K. P. (2006b). The art of data mining the minefields of toxicity databases to link chemistry to biology. Curr. Comput. Aided Drug Design 2, 135–150. Zhu, Y., Zhu, Y., and Xu, W. (2008). EzArray: A web-based highly automated Affymetrix expression array data management and analysis system. BMC Bioinformatics 9, 46.