Download Toward a Public Toxicogenomics Capability for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
TOXICOLOGICAL SCIENCES 109(2), 358–371 (2009)
doi:10.1093/toxsci/kfp061
Advance Access publication March 30, 2009
Toward a Public Toxicogenomics Capability for Supporting Predictive
Toxicology: Survey of Current Resources and Chemical Indexing of
Experiments in GEO and ArrayExpress
ClarLynda R. Williams-Devane,* Maritja A. Wolf,† and Ann M. Richard‡,1
*U.S. EPA/Office of Research and Development (ORD)/National Health & Environmental Effects Research Laboratory (NHEERL), Research Triangle Park, NC
27519; †Lockheed Martin (Contractor to U.S. EPA), Research Triangle Park, NC 27519; and ‡U.S. EPA/Office of Research and Development (ORD)/National
Center for Computational Toxicology (NCCT), Research Triangle Park, NC 27519
Received January 18, 2009; accepted March 23, 2009
A publicly available toxicogenomics capability for supporting
predictive toxicology and meta-analysis depends on availability of
gene expression data for chemical treatment scenarios, the ability
to locate and aggregate such information by chemical, and broad
data coverage within chemical, genomics, and toxicological
information domains. This capability also depends on common
genomics standards, protocol description, and functional linkages
of diverse public Internet data resources. We present a survey of
public genomics resources from these vantage points and conclude
that, despite progress in many areas, the current state of the
majority of public microarray databases is inadequate for supporting these objectives, particularly with regard to chemical indexing.
To begin to address these inadequacies, we focus chemical
annotation efforts on experimental content contained in the two
primary public genomic resources: ArrayExpress and Gene
Expression Omnibus. Automated scripts and extensive manual
review were employed to transform free-text experiment descriptions into a standardized, chemically indexed inventory of experiments in both resources. These files, which include top-level
summary annotations, allow for identification of current chemicalassociated experimental content, as well as chemical-exposure–
related (or ‘‘Treatment’’) content of greatest potential value to
toxicogenomics investigation. With these chemical-index files, it is
possible for the first time to assess the breadth and overlap of
chemical study space represented in these databases, and to begin
to assess the sufficiency of data with shared protocols for chemical
similarity inferences. Chemical indexing of public genomics
databases is a first important step toward integrating chemical,
toxicological and genomics data into predictive toxicology.
Key Words: microarray; chemical; toxicogenomics; toxicity;
prediction.
Disclaimer: This manuscript was approved by the U.S. EPA’s National
Center for Computational Toxicology for publication. However, the contents
do not necessarily reflect the views and policies of the EPA and mention of
trade names or commercial products does not constitute endorsement or
recommendation for use. Each of the authors declares no competing interests
pertaining to the present work.
1
To whom correspondence should be addressed at Mail Drop D343-03, 109
TW Alexander Dr., U.S. Environmental Protection Agency, Research Triangle
Park, NC 27711. Fax: (919) 685-3263. E-mail: [email protected].
Published by Oxford University Press 2009.
Conventional toxicology investigates cellular and animal
responses to chemical treatment through domain-specific bioassay studies (e.g., chronic, developmental), typically mapping
a single chemical to a toxicological endpoint. Microarray
technologies, in contrast, detect genome-wide perturbations
resulting from a chemical treatment, and measure response
variables that probe a large number of genes and gene pathways
potentially underlying multiple toxicological endpoints. A
typical toxicogenomics experiment requires that linkages be
established between these technologies, focusing on treatmentrelated effects of one or a few chemicals and attempting to relate
gene expression changes to a toxicological endpoint (Gomase
et al., 2008; Hamadeh et al., 2002; Hirabayashi and Inoue, 2002).
In silico toxicogenomic meta-analysis methods combine data
across existing toxicological and gene expression experiments to
generate new, and to confirm existing hypotheses of the effect of
a compound treatment. Such a capability depends upon the
availability of gene expression data derived from chemical
treatment scenarios, as well as anchoring toxicology data to
support predictive inferences.
The chemical nature of the problem requires a standardized,
chemical-centric view of data at all levels. Hence, a publicly
available toxicogenomics capability sufficiently robust for
mechanistic inferences and building predictive models requires
not only common data standards, protocols, and the ability to
query and aggregate common data types across resources, but
also broad data coverage within, and linkages across chemical,
genomics and toxicological information domains. These
requirements have, to varying degrees, informed development
of the major public microarray databases, and have been the
central design principle of specialized toxicogenomic resources
(Waters et al., 2008). In recent years, there have also been
significant advances in promoting toxicology standards and
data models (i.e., controlled vocabulary and hierarchical data
organization), quantitative high-throughput screening, and
chemically indexed bioassay data that, taken as a whole, have
359
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
the potential to greatly enhance toxicogenomics capabilities in
the public domain (Dix et al., 2007; Martin et al., 2009;
Richard et al., 2008; Yang et al., 2006a, 2006b).
In the genomics field, the two largest public resources for
deposition of microarray data, approved by the Microarray
Gene Expression Data (MGED) Society (http://www.mged.
org/), are the European Bioinformatics Institute’s (EBI)
ArrayExpress (http://www.ebi.ac.uk/arrayexpress) and the
National Center for Biotechnology Information’s (NCBI) Gene
Expression Omnibus (GEO) (www.ncbi.nlm.nih.gov/geo).
Publishing requirements for the deposition of raw or processed
microarray data into these database repositories, coupled with
MIAME (Minimum Information About a Microarray Experiment) standards for data reporting, are increasing the comparability, utility and breadth of these resources (Ball et al., 2004).
Enhanced external programmatic access to the major public
microarray data repositories also allows third parties to
automatically extract and reformulate data to enhance informatics and data mining capabilities (Boyle, 2005; Ivliev et al., 2008;
Zhu et al., 2008). Additional public efforts are aimed at
standardizing the description of experimental protocols (Taylor
et al., 2008), as well as improving toxicity data standards in
relation to toxicogenomics experiments (Burgoon, 2007; Fostel,
2008; Fostel et al. 2005, 2007). Largely neglected in the
genomics field, however, has been the standardization of
chemical information associated with the experimental data
when chemical treatment is a primary objective of the
experiment. Such annotation is essential for systematically
relating chemical property and effects information, irrespective
of whether the study has an explicit toxicological focus, across
the diverse data domains potentially contributing to toxicogenomics. Furthermore, the ability to query, relate, and aggregate
information by chemical and across chemical space is essential
to the goal of chemical screening and toxicity assessment
(Dix et al., 2007; Richard et al., 2008; Yang et al., 2008).
In the remainder of this paper, we broadly survey the current
state of public microarray resources from the above vantage
points, focusing particularly on the two primary resources,
ArrayExpress and GEO. Although the latter resources are not
explicitly designed to meet the needs of the toxicogenomics
community, they currently serve as the two largest public
microarray data repositories of potential toxicogenomic
relevance and, as such, are potentially valuable sources of
data for toxicogenomics study. Despite progress in many areas,
we find the present state of public microarray repositories
inadequate for supporting interoperability and linkages across
diverse data domains in support of toxicogenomics. Particularly noteworthy is the lack of minimal chemical annotation
and, as a result, the effective isolation of these resources and
associated data from the growing inventories of chemically
indexed bioassay information of potential relevance to
toxicology (Richard et al., 2006, 2008).
To begin to address these inadequacies, we propose and
implement a set of standard genomic fields for indexing of
experimental study records, aligned with current MIAME
guidelines, that enables cross-referencing and comparison of
total experimental content in GEO and ArrayExpress. In
addition, we implement a set of established chemical standards
for labeling experiments contained within ArrayExpress and
GEO, in collaboration with the U.S. Environmental Protection
Agency’s (EPA) Distributed Structure-Searchable Toxicity
Database (DSSTox) project. We briefly describe the process
of annotation and creation of public-distribution DSSTox
chemical-index files for both GEO Series and Array Express
Repository. These files enable, for the first time, assessment of
the chemical scope, diversity, and coverage of experimental
content within GEO and ArrayExpress of potential use for
toxicogenomics study.
METHODS
For the purpose of assessing the relevance of public microarray resources to
toxicogenomics and predictive toxicology, we considered the current
annotation of experimental content pertaining to chemical treatment scenarios,
that is, cases in which study of gene expression changes induced by chemical
treatment constituted the primary goal of the experiment. As a measure of
interoperability between data resources, we examined the standardization of
terminology and data accessibility, as well as the formatting of data, paying
particular attention to specification of experimental protocols, such as animal/
tissue/cell treatment, RNA extraction, microarray preparation, data import/
export, and analysis. As a measure of chemical indexing, we examined the
degree of standardization and annotation pertaining to chemical-associated
experiments across public genomics resources and, particularly, whether the
chemical information was formally indexed, that is, contained in a separate,
searchable field.
For the purpose of chemically indexing experimental content in ArrayExpress and GEO, a chemical-exposure (or ‘‘Treatment’’) microarray experiment
is broadly defined by us as a study in which the cells, tissues, or whole
organisms were treated with a defined chemical, chemical mixture, or natural
substance (including biologics and proteins), DNA was extracted, and gene
expression changes resulting from this treatment were investigated with
microarray technologies. Whether the chemical to which the system was
exposed is a known toxicant, potential toxicant, natural substance, or
therapeutic agent need not be distinguished because the measured outcome is
the same, that is, treatment-related gene expression changes. However,
experiments in which chemical treatment was secondary to the primary
purpose of the experiment (e.g., treatment with prophylactic antibiotics for
maintaining tissue culture conditions) or where study of chemical-exposure–
induced effects was not the primary purpose of the experiment (e.g., treatment
with streptozocin to induce Diabetes Mellitus for investigating the effects of
diabetes) required further annotation and review. These cases of chemicalexperiment associations were labeled by us to indicate the role of the chemical
as other than ‘‘Treatment.’’
For initial inventory purposes, extraction of experimental description fields,
and locating chemical-associated experiments within ArrayExpress and GEO,
we used available web search options and programmatic access tools within
each system, as well as extensive manual review (a workflow diagram is
provided in Supplemental Fig. 1; additional details of the methodology
employed here are publicly available—see Acknowledgments). For the present
study, GEO Series provides the most complete inventory of current
experiments within GEO and these are also most closely aligned with
ArrayExpress Repository experiments. Hence, ArrayExpress Repository and
GEO Series experiments were the focus of the present chemical-indexing
efforts.
360
WILLIAMS-DEVANE, WOLF, AND RICHARD
ArrayExpress. Due to its large size (over 6300 experiments at the time of
data extraction), limited and unstructured chemical annotation, and dynamic
content (updated regularly with new experiments), the review and annotation
process for ArrayExpress involved several iterative steps for identification and
characterization of chemical treatment experiments within the main database
repository. Initially, a bulk download of all data housed in the repository from
the main web site ((http://www.ebi.ac.uk/arrayexpress) was undertaken with
a wildcard query in the accession number query box (i.e., to retrieve all
experiments). The resulting records were individually reviewed and a preliminary index of chemical information and <Indications of a Chemical
Exposure Record> was constructed. The latter field included any detail deemed
as potentially useful for discerning whether a record pertained to a chemical
treatment experiment, for example, designations in the ArrayExpress
<Experimental_Type> field, such as ‘‘dose,’’ ‘‘treatment,’’ etc. This preliminary chemical index was used to identify chemical-associated and chemical
‘‘treatment’’ experiments, to infer the minimum information necessary to
identify such records from within ArrayExpress, and to build and test an
automated indexing capability using custom Perl scripts (http://www.perl.com).
Through an iterative process, scripts were refined to achieve better success at
detecting ‘‘true’’ chemical treatment experiments, verified by manual review
according to our definition above. Perl scripts for keyword text searching and
filtering were combined with manual curation methods to construct ArrayExpress Repository chemical-index files from website content downloaded on
September 20, 2008.
Gene Expression Omnibus. GEO contains user-deposited dynamic
content, and limited and unstructured chemical annotation. Hence, a manual
method similar to that employed in the review of ArrayExpress was initially
required. All data were downloaded from the GEO homepage in the GSE Series
format. Each of the GEO Series was manually reviewed for chemical content
and this information was used to construct an index of the chemical information
and <Indications of a Chemical Exposure Record>. As in ArrayExpress, the
<Indications of a Chemical Exposure Record> field contained details to aid in
discerning whether a record pertained to a chemical treatment experiment.
From this chemical-experiment index, the first chemical annotation of GEO
was completed. Similar to the annotation of ArrayExpress, this manually
curated chemical index was used to test and refine automated curation
approaches that were applied to subsequent versions of the GEO Series
inventory. Several automated methods were developed using NCBI Entrez
Programming Utilities (E-Utilities) (http://www.ncbi.nlm.nih.gov/entrez/query/
static/eutils_help.html), an XML version of the U.S. National Library of
Medicine’s (NLM) chemical Medical Subject Headings (MeSH) library (http://
www.ncbi.nlm.nih.gov/sites/entrez?db ¼ mesh), and a series of custom Perl
scripts to parse through a complete XML version of the GEO Series database.
The chemical index of GEO Series was completed using a series of Perl scripts
that call on E-Utilities, combined with manual curation methods, and was based
on content downloaded on September 20, 2008.
Chemical index. The main result of the above process was to produce
a static, preliminary chemical index for all chemical-associated microarray
experiments in ArrayExpress and GEO, in which the subset of chemical
‘‘treatment’’ experiments were identified. These preliminary index files took the
form of a list of minimal chemical identifiers (most often chemical names only)
directly extracted from the user-deposited information in these two resources.
These chemical-experiment index files subsequently underwent a rigorous
cleanup and chemical quality review, using source (submitter)–provided
chemical information and contextual text descriptions to definitively identify
the chemical substance and its relationship to the experiment (e.g., treatment,
vehicle, reference). The generally poor quality and consistency of chemical
information contained in ArrayExpress and GEO submitter-supplied description fields, the high frequency of abbreviations and spelling errors, and
the lack of chemical identifiers such as Chemical Abstracts Service Registry
Numbers (CASRN; http://www.cas.org/) or EBI’s Chemicals of Biological
Interest (ChEBI; http://www.ebi.ac.uk/chebi) identifiers, all prevented greater
application of automated text-mining and chemical name-to-structure conver-
sion capabilities. In addition, the need to accurately discern the role of the
chemical in the experiment (i.e., treatment, etc.) from the free-text description
prevented use of more efficient automated methods.
DSSTox Standard Chemical Fields were assigned to the chemical-index files
according to established procedures (http://www.epa.gov/ncct/dsstox/
ChemicalInfQAProcedures.html). These fields allow for standardized representation of both the test substance (‘‘TestSubstance’’ fields) and the chemical
structure (‘‘STRUCTURE’’ fields) in relation to any chemical-associated
experiment record. DSSTox Standard Chemical Fields include chemical name,
CASRN (if available), and test substance description (e.g., single chemical
compound, macromolecule, mixture or formulation, etc). Where the test
substance is not overly large (> 1800 amu) and can be reasonably represented
by a single molecular structure, ‘‘STRUCTURE’’ fields are provided. These
include a public standard, ‘‘molfile’’ of the chemical structure (a twodimensional projection of the three-dimensional structure) assigned to the
substance, several fields automatically derived from the ‘‘molfile’’ structure
(i.e., molecular weight, formula, IUPAC name, SMILES, SMILES_Parent,
InChI, InChIKey), chemical type (i.e., defined organic, inorganic, organometallic), and a field indicating the relationship of the STRUCTURE to the
TestSubstance (i.e., tested chemical, representative isomer in mixture, active
ingredient in a formulation, etc.) (for more information, see http://www.epa.
gov/ncct/dsstox/MoreonStandardChemFields.html). Assessment of chemical
overlap between GEO and ArrayExpress DSSTox chemical-index files was
determined on the basis of DSSTox TestSubstance identifiers.
RESULTS
Over 42 public Internet resources housing microarray data of
potential toxicogenomics relevance were initially identified
from various categories (Microarray World list of databases,
http://www.microarrayworld.com/DatabasePage.html). From
this list, we identified eight resources containing chemicalexposure–related content, and divided these into two categories: primary and secondary genomics resources. Primary
genomics resources consist of the three MIAME-supportive,
MGED-approved gene expression repositories: NCBI’s GEO,
EBI’s ArrayExpress, and the Center for Information Biology
Gene Expression (CIBEX) database (see Table 1 for listing of
Sources, URLs, and references). Secondary genomics resources consist of five additional public genomics resources of
potential toxicogenomics relevance that contain data gathered
from chemical-exposure experiments in one or more laboratories (see Table 1 for listing of Sources, URLs, and references).
A selection of public cheminformatics resources potentially
useful for supporting a public toxicogenomics capability are
listed in Supplemental Table 1. A brief description of survey
results are given for each data resource below, followed by
chemical-indexing results for the two major resources,
ArrayExpress and GEO.
ArrayExpress Repository
ArrayExpress is the largest user-depositor data repository
and MIAME-supportive public archive of microarray data in
Europe, consisting of two parts—ArrayExpress Repository and
the ArrayExpress Data Warehouse (Table 1). The ArrayExpress Repository currently exceeds 6900 experiments, and is
361
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
TABLE 1
Primary and Secondary Genomics Data Resources with Content of Potential Use for Toxicogenomics
Database
Primary genomic
resources
ArrayExpress
European Bioinformatics Institute
(EBI); www.ebi.ac.uk/
microarray-as/ae/
GEO
National Center for Biotechnology
Information (NCBI), National
Institutes of Health; www.ncbi.
nlm.nih.gov/geo
DNA Data Bank of Japan (DDBJ),
National Institute of Genetics;
http://cibex.nig.ac.jp/
McArdle Laboratory for
Cancer Research, University of
Wisconsin-Madison; http://
edge.oncology.wisc.edu/edge3.php
National Institute of Environmental
Health Sciences (NIEHS); http://
cebs.niehs.nih.gov/cebs-browser/
Center for Genetic Medicine
Research; http://pepr.
cnmcresearch.org/
Department of Biochemistry &
Molecular Biology, Michigan
State University; http://
dbzach.fst.msu.edu
Mount Desert Island Biological
Laboratory; http://ctd.mdibl.org
CIBEX
Secondary genomic
resources
Source/URL
EDGE
CEBS
PEPR
dbZach
CTD
indexed by Experiment Array Design and Protocol. Experiments can be queried by Keyword, Experimental Accession
Number, Species, Experiment Type and Factors, Author,
Laboratory, and Publication information (http://www.ebi.
ac.uk/microarray-as/aer/entry). Repository data are cataloged,
assessed for completeness, and assigned a MIAME score that
represents the degree of MIAME compliance. The ArrayExpress Data Warehouse is based on more limited processed data
results from the ArrayExpress Repository, currently contains
740 Expression Profiles (website accessed on November 14,
2008), and allows users to browse curated datasets from both
a gene- and/or experiment-centric view. ArrayExpress has
incorporated significant experimental content from GEO,
which can be located within ArrayExpress by GEO Accession
Identifiers.
At the time of survey, ArrayExpress was not chemically
indexed, nor did it contain additional information about the
chemical tested other than the infrequently provided CASRN
or ChEBI number. Chemical information may be located in the
user-supplied protocols and free-text experimental description,
or can be searched with the advanced query tools from
ArrayExpress, including a keyword or text search in the
Description field in ‘‘Query for Experiments.’’ These can also
be combined with specifications of <Experiment type> ¼
References
Public data deposition
Programmatic access
Ball et al., 2004;
Brazma et al., 2006;
Parkinson et al., 2007;
Rustici et al., 2008
Barrett and Edgar, 2006;
Barrett et al., 2007;
Wheeler et al., 2008
Yes
Yes (XML)
Yes
Yes (E-Utilities)
Tateno and Ikeo, 2004
Yes
No
Hayes et al., 2005
No
No
Fostel et al., 2005;
Waters et al., 2003; 2008
Yes
No
Chen et al., 2004
No
No
Burgoon et al., 2006
No
No
Davis et al., 2009
No
No
‘‘compound treatment’’ or ‘‘dose response,’’ but these latter
annotations are optionally utilized and not consistently applied
by depositors to all chemical treatment experiments in the
database. Chemical information can also be embedded within
the ArrayExpress Sample-Data Relationship File (http://
tab2mage.sourceforge.net/docs/sdrf.html).
In 2002, ArrayExpress introduced the Tox-MIAMExpress
data entry method, optionally employed by data submitters to
store toxicogenomics data in an effective manner (http://
www.ebi.ac.uk/miamexpress/). Tox-MIAMExpress was later
discontinued; however, the ArrayExpress Accession Number
Code, TOXM, designated to identify experiments for this
purpose, is still available for use when requested by data
submitters. Currently, the optional TOXM label is assigned to
fewer than 25 experiments, but in these cases, typically more
chemical identifier information, such as a CASRN and/or
a ChEBI number, is provided by the submitter along with
additional information recommended by the MIAME/Tox
initiative (http://www.ebi.ac.uk/tox-miamexpress).
Gene Expression Omnibus
GEO is the largest user-depositor data repository and
MIAME-supportive public archive of microarray data in the
U.S. (Table 1), containing data from approximately 10,000
362
WILLIAMS-DEVANE, WOLF, AND RICHARD
experiments at the time of this writing. In GEO, raw and/or
processed data can be exported through the ftp website as well
as through the main GEO Series website. User information,
however, is entered using a free-text format that is subsequently curated. GEO allows for a wide range of informed
queries with the Preview/Index window, where users can select
data based on choices for each attribute of the experiment. The
GEO repository has three key components: ‘‘Platform,’’
‘‘Sample,’’ and ‘‘Series.’’ ‘‘Platform’’ provides a description
of the array used in the experiment, as well as a data table
defining the array template. The data table contains hybridization measurements for each element of the corresponding
platform. ‘‘Sample’’ provides a description of the biological
source and the experimental protocols. ‘‘Series’’ defines a set of
related samples considered to be part of a study and describes
the overall study aim and design. GEO has a complex,
hierarchical structure that works with the NCBI E-Utilities,
allowing one to query by submitter, organism, platform,
sample type, sample titles, and release date. Similar to
ArrayExpress, GEO hosts a smaller warehouse-type addition
named ‘‘GEO Datasets and Profiles’’ containing processed,
curated datasets that can be explored from both a gene- and/or
experiment-centric view.
Also, similar to Array Express, GEO is not chemically
indexed nor does it consistently contain information about the
chemical tested. Chemical names may be located in the
submitter-deposited GEO Data Series fields—Title, Summary,
Citation, or Samples—and are not consistently present in any
single field. Chemical names are provided by the submitter, are
rarely accompanied by CASRN or ChEBI identifiers, and do
not undergo curation or review. Hence, as is the case with
ArrayExpress, there is no easy or reliable way to identify
a chemical-exposure–related experiment, there is no central
listing of chemical content and, in both resources, we find that
the chemical names embedded within user-deposited description fields are highly variable, prone to errors and misspellings,
and frequently incorporate nonstandard abbreviations.
Center for Information Biology Gene Expression
The CIBEX database is a Japanese gene expression
MIAME-supportive, MGED-approved user-depositor system
(Table 1) that primarily serves experimenters from Asian
countries. It is included for completeness sake, but currently
does not contain significant chemical treatment content.
However, the experimental protocol and detail standardization
are noteworthy, with each record accompanied by a document
containing full MIAME details. There is also a high level of
curation and collaboration between CIBEX administrators and
depositors that allows for missing information to be identified
before publication, as well as for a high level of standardization
and accuracy.
At the time of this writing, CIBEX contains 32 experiments,
only one of which is a chemical-exposure experiment, with
CBX14 clearly labeled in the <Experiment Design Type> field
as ‘‘compound_treatment_design.’’ Despite the high degree of
standardization of this resource, however, there is currently no
formal chemical annotation field accompanying a chemical
treatment experiment.
Environment, Drug, Gene Expression Database
The EDGE database (Table 1) is a closed (i.e., not open to
public user-deposits of data), curated system designed for the
comparison, analysis and distribution of toxicogenomics
information in a relational format. EDGE is chemical treatment
centric and chemically indexed, with a toxicological focus. All
experiments were performed in the Bradfield Laboratory using
a standardized protocol involving custom cDNA arrays of
minimally redundant hepatic clones, chosen through chemicalexposure experiments with prototype hepatic toxicants: 2,3,7,8,
tetrachlorodibenzo-p dioxin (TCDD), cobalt chloride, and
phenobarbital. The experimental conditions include 22 chemical treatments, 4 control treatments, and 1 environmental
stressor (fasting) over 1 mutant (circadian wild-type control).
All chemical treatments were chosen for the express purpose of
investigating transcriptional profiles pertaining to hepatotoxicity in mice.
Despite its small size and limited focus, EDGE incorporates
a high level of standardization and comparability across
species, array, experimental protocol, and experimental details,
and demonstrates how a fully relational database built on such
data can facilitate toxicogenomics investigation. However,
EDGE is not a user-depositor system and currently lacks the
tissue, species, and chemical diversity necessary for broader
toxicogenomics exploration.
Chemical Effects in Biological Systems
CEBS is a public user-depositor data repository with an
explicit toxicological and toxicogenomics focus (Table 1).
CEBS can accommodate study design, timeline, clinical
chemistry, and histopathology findings, as well as microarray
and proteomics data. Each experiment in CEBS pertains to
a chemical/environmental exposure or a genetic alteration in
reference to clinical or environmental studies. CEBS has
a complementary functional component known as the Biomedical Investigation Database (BID) (https://dir-apps.niehs.
nih.gov/arc/), which is a relational database used to load and
curate study data prior to exporting to public CEBS. BID also
aids in the capture and display of novel data, including PCR
and toxicogenomic-relevant fields, as used in ArrayExpress’s
TOXM designation.
CEBS is currently indexed by study and subject characteristics, such as environmental, chemical, or genetic stressor and
stressor protocol, and includes observations on rat, mouse, and
C. elegans. CEBS is one of the few genomics resource profiled
in this survey, and the only resource with significant
toxicogenomics-relevant microarray content, that incorporates
formal chemical name annotation of experiments. At the time
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
of this writing, CEBS lists an inventory of 136 chemical
names, or ‘‘chemical stressors,’’ associated with experimental
content, along with a searchable CASRN field containing 121
entries. CEBS plans to incorporate additional chemical standards, including structure annotation, in collaboration with the
EPA DSSTox project.
Public Expression Profiling Resources
Similar to EDGE, the PEPR database (Table 1) is a closed,
curated system designed to serve as a public resource of gene
expression profile data generated in the same laboratory, using
the same chip type for three species, and subject to the same
quality and procedural controls. PEPR is aimed at providing
a standardized warehouse for the analysis of time-series data.
The high degree of standardization within PEPR grants users
comparability across arrays without laboratory and array bias,
much like EDGE. PEPR adheres to quality control and
standard operating procedures and is indexed by Principle
Investigator, Tissue type, Experiment, and Organism, but has
a very few chemical treatment–related experiments and lacks
relational searching capabilities. However, the time-series
query analysis tool (SGQT) enables the novel generation of
graphs and spreadsheets showing the action of any transcript of
interest over time. PEPR also differs from EDGE in the
extensive data export options that include raw image files
(.dat), processed image files (.cel) and interpretation files (.txt).
PEPR also has external links to GEO, where PEPR data are
mirrored through an automated export/import process.
In PEPR, chemical information is stored in free-text fields
such as the title, description, and array titles, similar to
ArrayExpress and GEO. At the time of this writing, PEPR
contains 72 experiments, of which 10 are determined to be
chemical/environmental exposure experiments. Hence, PEPR
currently covers very limited chemical space, but the SGQT
tool for analysis of time-series microarray data, as well as the
standardized chemical-exposure experiments are of potential
value for toxicogenomics studies.
DbZach
dbZach, a laboratory tool offered for local installation, is of
interest as a modular MIAME-compliant, toxicogenomicsupportive relational database designed to facilitate data
integration, analysis, and sharing in support of mechanistic
toxicology and toxicogenomics studies (Table 1). dbZach
consists of several subsystems for the standardization of all
data elements of a toxicogenomics experiment as well as
traditional toxicological experiments and, additionally, has
built-in functionality for data import and export of both raw
data and processed data. Similar to EDGE, the dbZach project
has created a sophisticated relational data environment for
integrating and exploring many aspects of a toxicogenomics
study. However, also similar to EDGE, this project is very
narrowly focused in chemical space and primarily limited to
estrogen and estrogenic chemicals.
363
Comparative Toxicogenomics Database
The CTD is worthy of mention for its toxicogenomics
relevance, but is not a traditional genomics database (Table 1).
Rather, it is a database of curated relationships between
chemicals, genes, and diseases mined from journal articles.
CTD provides text-mineable access to the toxicogenomic
literature, but currently provides direct linkage to only one
secondary genomics resource, that is, EDGE.
Also worthy of note, CTD uses the chemical subset of the
NLM MESH vocabulary to provide formal chemical annotation of its content and to link to various chemically indexed
toxicology resources (http://ctd.mdibl.org/resources.jsp?type ¼
chem). The present CTD inventory of over 4400 chemical
substances also has recently been deposited into the NCBI
PubChem resource (http://pubchem.ncbi.nlm.nih.gov/; Supplemental Table 1) to offer structure-searchability and broader
access to chemically indexed resources.
Table 2 compares the above inventory of genomics resources
from the standpoint of being chemically indexed (i.e., chemical
identifiers are required and entered in standard fields),
MIAME-supportive, and standardized with respect to various
experimental descriptions. Additional details on the comparison of the primary and secondary genomics resources identified
in this study with respect to the types of gene expression data
stored, toxicological focus, formats of data available for
download (raw or processed), ability to query data, ability to
import or export experimental data, and programmatic access
are presented in Supplemental Table 2.
Web-based queries and programmatic access were used in
the present study to extract current experimental content from
ArrayExpress Repository and GEO Series, and to identify
corresponding experiment annotation fields (resulting from
adherence to MIAME guidelines in the two systems) that could
be mapped to common fields to enable comparisons across the
two inventories. We implemented a set of 14 Standard
Genomics Fields in Table 3 to serve this purpose and to
confer read-across capability between the two inventories. All
but two of these fields map to existing MIAME-compliant data
fields, which vary only slightly in name in GEO and
ArrayExpress (see expanded columns in Supplemental Table
3) and, thus, are straightforward to implement. One new field,
‘‘Experiment_URL,’’ contains a static URL link to enable
outside Internet access directly to the experiment accession
summary page in either ArrayExpress or GEO. The last field,
‘‘Chemical_StudyType,’’ has no corresponding field in either
ArrayExpress or GEO, and was introduced by us to begin to
address the currently missing chemical annotation layer for
gene expression experiments in both resources.
DSSTox chemical-index files for GEO and ArrayExpress
created by the above methods are publicly available for
download from the DSSTox website (http://epa.gov/ncct/
dsstox/). In addition to the main DSSTox chemical-index files
that include one record for each unique chemical (i.e., unique
364
WILLIAMS-DEVANE, WOLF, AND RICHARD
TABLE 2
Standardization and Indexing of Genomics Data Resources
Standardizedb
Data resourcea
Indexed by
chemical
MIAME-supportive
Species
Array
information
Experimental
protocol
Experimental
details
Allows relational
searchingc
ArrayExpress
GEO
CIBEX
EDGE
CEBS
PEPR
dbZach
CTD
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
þ
NA
þ
þ
þ
þ
þ
NA
þ
þ
þ
þ
þ
þ
þ
þ
þ
a
Refer to Table 1 for full names, sources, URLs, and references associated with these data resources; feature present (þ) or absent ().
Standardized entries refer to internal content adhering to controlled vocabularies, and represented in defined and required fields; NA, not applicable.
c
Relational searching refers to the ability to construct AND/OR-type queries across the content of defined fields.
b
test substance) in the ‘‘Treatment’’ category in each of the two
repositories, we have published Auxiliary files that include
DSSTox Standard Chemical Fields, Standard Genomics Fields
(14), and additional Source-specific experiment description
fields (33 for ArrayExpress, 4 for GEO) for the full chemicalassociated experiment inventories in the two resources (i.e.,
one record for each chemical-experiment pair). Detailed
descriptions of the content of these files and their incorporation
into the DSSTox Structure-Browser (http://www.epa.gov/
dsstox_structurebrowser/) and PubChem, the results of which
enable structure-based Internet linkages directly to ArrayExpress and GEO experiment summary pages, are provided
elsewhere (Williams-Devane et al., 2009).
Table 4 provides a breakdown of the current chemicalassociated experimental content within ArrayExpress Repository and GEO Series according to all Chemical_StudyType
TABLE 3
Standard Genomics Fields for Common Indexing of Experiments Contained in ArrayExpress Repository and GEO Series
Field name
Experiment_Accession
Experiment_AlternativeAccession
Experiment_IdNumber
Experiment_Title
Experiment_Description
Experiment_URL
Experiment_PubMed_Information
Experiment_PublicationDate
Species
Number_Samples
Experiment_ArrayAccession
Experiment_ArrayType
Experiment_ArrayTitle
Chemical_StudyType:a
Reference
Treatment
Vehicle
Combination_Treatment
Media
Not_Enough_Information
a
Description
A unique combination of informative prefix and number used to identify each dataset.
An alternate accession number. Example: GEO files in ArrayExpress have GSE#### (GEO Series) secondary Accession
number for users
to find the same data in GEO).
A unique identification number for each experiment within each database.
The title of the experiment.
A free-text, user-submitted description of the experiment or dataset.
URL links to the Source Experimental Download Page.
A unique number that links users to each PubMed publication associated with each experiment or dataset.
Date indicating when the dataset was released to the public or published.
Species as listed by the user.
Number of samples used within a microarray experiment or dataset.
An accession number for each array design or platform.
Details about the platform used or details about data other than raw data that users have submitted.
The user-submitted title of the Array/Platform used in the experiment.
A designation of the role of the identified chemical in the given experiment. Allowed entries are listed as Subsections to
this field (e.g., Reference, Treatment, Vehicle, . . .).
Chemical used to mimic a biological or environmental situation.
The primary focus of experiment or study is to understand the transcriptomic effects of the chemical.
Chemical used to aid the administration of the treatment to the organism, such as dimethyl sulfoxide.
Multiple chemicals used together for treatment purposes (see ‘‘Treatment’’ above).
Chemical used in maintenance of the tissue culture or sample conditions, such as phosphate buffered saline.
Sufficient information is not present in the experimental description to determine the role of the chemical.
Subsections to the Chemical_StudyType field have allowed entries: Reference, Treatment, etc., with linkage text ‘‘AND’’ used for combinations (e.g.,
TreatmentANDReference).
365
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
TABLE 4
Classification of Chemically Indexed Genomics Experiments in ArrayExpress and GEO by ‘‘Chemical_StudyType’’
Databasea
ArrayExpress Repository
GEO Series
Chemical_StudyType Classificatione
Breakdown of Total no. of Chemical-Experiment Records (Unique Chemicals)f
Total no. of
ChemicalCombination
Total no. of Experiment
Total no. of
Multiple
Treatmentg Classifications Other
Experimentsb Recordsc Unique Chemicalsd Treatmentg Reference Vehicle Media
6346
9957
2365
2381
1011
1064
1609 (810) 266 (157) 138 (26) 111 (68)
1951 (838) 152 (60)
81 (48) 14 (14)
109 (91)
72 (38)
118 (83)
111 (67)
14 (10)
0 (0)
a
All numbers relate to database content extracted on September 20, 2008.
Total number of experiments contained in the public resource (also corresponds to the number of unique Accession IDs).
c
Number of Chemical-Experiment pairs extracted from the Total no. of Experiments prior to determination of the Chemical_StudyType Classification, where
some experiments in the Total no. of Experiments map to no chemicals, and some experiments involving multiple chemicals map to more than one ChemicalExperiment record.
d
Total number of unique chemical test substances (i.e., no chemical test substance identity is duplicated) identified in the total group of Chemical-Experiment
records, irrespective of Chemical_StudyType Classification.
e
Definitions of Chemical_StudyType Classifications are provided in Table 3.
f
Number of Chemical-Experiment Records corresponding to each Chemical_StudyType category (with corresponding number of unique chemicals in
parentheses), where for the purposes of this table one record is assigned to one category and if the chemical is used for different purposes within one experiment
(e.g., TreatmentANDReference), it is assigned to the ‘‘Multiple Classifications’’ category.
g
Number of Chemical-Experiment Records (with corresponding number of unique chemicals in parentheses) out of the total group of Chemical-Experiment
Records that are associated with the ‘‘Treatment’’ category according to the criteria for a chemical-exposure scenario set forth in this paper; any record labeled as
‘‘Treatment’’ or ‘‘CombinationTreatment’’ (alone or in combination with other Chemical_StudyType labels, e.g., TreatmentANDReference), are included in the
final DSSTox chemical-index file.
b
categories (all counts correspond to data extracted on September
20, 2008). Of the 6346 total ArrayExpress experimental
descriptions downloaded, more than a third (2365) were
identified as chemical-associated experiments by the procedures
outlined in the Methods section, corresponding to 1011 unique
chemical test substances. Similarly, of the 9957 GEO Series
experimental descriptions downloaded, nearly a quarter (2381)
were identified as chemical-associated experiments, corresponding to a total of 1064 unique chemical test substances (Table 4).
Table 5 provides a breakdown of the ‘‘Treatment’’associated experimental content within the ArrayExpress
Repository and GEO Series according to DSSTox chemical
classification categories. Of the 1835 total ‘‘Treatment’’associated experiments in the ArrayExpress Repository, 1282
experiments (or 70% of the total) are associated with a ‘‘defined
organic’’ chemical test substance (note that multiple experiments can map to the same chemical). GEO Series contains
a similarly high percentage of ‘‘Treatment’’ experiments
associated with a defined organic chemical test substance, that
is, 1544/2134, or 72%. The above indicators give a rough sense
of the size of the inventory of microarray experiments
associated with defined organics in the public domain.
Table 5 also provides indications of the size of the chemical
space associated with these ‘‘Treatment’’ experiments. Of the
total number of unique chemical test substances associated
with the ‘‘Treatment’’ category of experiments in ArrayExpress
Repository, 628/887, or 71% correspond to defined organics.
Although these include some drugs, small peptides and
biologics with molecular weights ranging 600–1700 amu, the
majority (> 90%) are small molecular weight (< 500 amu)
organic chemicals for which a chemical structure can be
assigned and that tend to be of greatest interest for environmental toxicology and structure-activity relationship models and
inferences. A similar percentage applies to GEO, that is, 751/
1014, or 74% of unique chemical test substances associated with
‘‘Treatment’’ experiments correspond to defined organics.
Hence, both resources span a relatively large number of unique
defined organic chemicals, which implies a broad chemical
diversity associated with public microarray experiments. Within
ArrayExpress, the chemical that maps to the largest number of
chemical-experiments is ‘‘estradiol,’’ occurring in 53 experiments, 44 of which are classified as ‘‘Treatment’’ experiments.
Comparison of GEO and ArrayExpress Experimental and
Chemical Content
Application of DSSTox Standard Chemical Fields and the
set of 14 Standard Genomics Fields enable direct comparison
of GEO Series and ArrayExpress Repository experimental and
chemical content. In addition, the DSSTox Auxiliary files for
ArrayExpress include a number of easily extracted field
characteristics affiliated with each experiment, including
Array/Platform type, Species, and the MIAME Score and its
five subcategories: Array or Platform information, Factor
information, Raw Data information, Processed Data information, and Protocol information. The latter annotations are
particularly valuable for assessing the sufficiency of the
experimental data for reanalysis.
The distribution of ArrayExpress ‘‘Treatment’’ chemicalexperiments assigned to these various categories of experimental
description is provided in Table 6. The distribution of MIAME
366
WILLIAMS-DEVANE, WOLF, AND RICHARD
TABLE 5
Classification of Chemically Indexed ‘‘Treatment’’ Genomics Experiments in ArrayExpress and GEO by DSSTox Chemical
Classification
Databasea
ArrayExpress
repository
GEO series
DSSTox Chemical Classificatione
Breakdown for ‘‘Treatment’’ Chemical-Experiment
Records (Unique Chemicals)f
Total no. of
ChemicalExperiment
Recordsb
Total no. of
‘‘Treatment’’
ChemicalExperiment
Recordsc
Total no. of
Unique Chemicalsd
No structureg
Defined organic
Inorganic
Organometallic
2365
1835
887
373 (179)
1282 (628)
153 (60)
27 (20)
2381
2134
1014
346 (173)
1544 (751)
210 (71)
34 (19)
a
All numbers relate to database content extracted on September 20, 2008.
See Table 4.
c
Total number of Chemical-Experiment records assigned to any ‘‘Treatment’’ Chemical_StudyType (e.g., Treatment, CombinationTreatment,
Treatment&Reference, etc.) according to the criteria for a chemical-exposure scenario set forth in this paper.
d
Total number of unique chemical test substances (i.e., no chemical test substance identity is duplicated) identified in the total group of ‘‘Treatment’’ ChemicalExperiment Records.
e
Refers to DSSTox Standard Chemical Field Definition and allowed entries for STRUCTURE_ChemicalType (http://www.epa.gov/ncct/dsstox/
CentralFieldDef.html#STRUCTURE_ChemicalType).
f
Number of ‘‘Treatment’’ Chemical-Experiment Records corresponding to each Chemical Classification category (with corresponding number of unique
chemical substances in parentheses), where each record maps to a single chemical classification and the list of unique chemicals for this ‘‘Treatment’’ subset of
experiments constitutes the final DSSTox structure-index file.
g
Number of ‘‘Treatment’’ Chemical-Experiment Records (with corresponding # of unique chemicals) where the chemical test substance is identified, but not
assigned to a DSSTox chemical structure, for example, this can be an undefined mixture, polymer, or macromolecule.
b
scores is particularly illuminating. A total MIAME Score of
5 indicates that all components of the MIAME compliance
criteria have been included by the submitter. Only 18% (or 216)
of the chemical treatment experiments in the ArrayExpress
Repository have all five components of information, whereas
50% (or 596) have four components of information. Most
noteworthy for this subset of ‘‘Treatment’’ experiments, however, raw data information is missing for 29% (or 347), processed data is missing for 11% (or 131), and protocol is missing
for 21% (or 291) (Table 6). Given that these are essential
experimental components for the reanalysis of gene expression
data, these numbers limit the number of chemical treatment
experiments within ArrayExpress that are potentially useful for
broader toxicogenomics investigation.
The ArrayExpress Repository has experienced steep growth in
the past few years, largely as a result of the integration of GEO
experimental content (approximately 4500 experiments were
added from January 2007 to January 2008). ArrayExpress files
with E-GEOD-XXXX accession numbers mirror GEO Series
entries and currently represent more than 50% of the chemicalexposure, or ‘‘Treatment’’ experiments in the ArrayExpress
Repository (Fig. 1). Figure 1 also shows that the total number of
chemical-experiment pairs (a pair being a 1:1 mapping of
chemical to experiment) and total number of ‘‘Treatment’’
chemical-experiment pairs identified in the current study are
comparable between ArrayExpress and GEO, with greater than
50% overlap of chemical-experiment pairs in all categories.
Unlike ArrayExpress, GEO Series currently provides no
MIAME scoring of content. However, because a significant
portion of the ‘‘Treatment’’ experiments represented in GEO
Series has been incorporated into the ArrayExpress Repository,
it was possible to create a table summarizing these experimental
factors for the subset of GEO chemical ‘‘Treatment’’ experiments contained within ArrayExpress (Table 6). Only 11% (or
81) of the GEO records in ArrayExpress are assigned a MIAME
Score of 5; however, 56% (or 415) have a MIAME score of 4. A
much greater percentage, 45% (or 335) of GEO records in
ArrayExpress, have corresponding Raw Data, whereas 100% (or
745) have Processed Data (most likely a precondition for
inclusion of GEO experiments in ArrayExpress).
Figure 2 presents overlap of the unique chemical content
pertaining to the ‘‘Treatment’’ chemical-experiment category.
Assessment of chemical overlap between GEO and ArrayExpress
DSSTox files was determined on the basis of DSSTox
‘‘TestSubstance’’ identifiers. The steroids, estradiol and dexamethasone, are associated with the largest numbers of microarray
experiments in both cases, and the largest number of shared
experiments as well. Other test substances most commonly
associated with experiments in either GEO or ArrayExpress
include Ethanol, 2,3,7,8-TCDD, Retinoic Acid, and Trichostatin,
each of which is of broad toxicological interest.
Assessing Toxicological relevance of GEO and ArrayExpress
Chemical Content
DSSTox chemical structure annotation enables, for the first
time, an examination of the chemical diversity and coverage of
GEO Series and ArrayExpress Repository experiments. We
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
TABLE 6
Characteristics of the ArrayExpress Repository pertaining to
‘‘Treatment’’ Experiments (based on Data Extracted on
September 20, 2008)
Characteristics
Array/platform
Species
MIAMEScore_Totalb
MIAMEScore_Arrayc
MIAMEScore_Factord
MIAMEScore_RawDatae
MIAMEScore_
ProcessedDataf
MIAMEScore_Protocolg
Major
characteristic
value
Affymetrix
Agilent
Other
Homo sapiens
Mus musculus
Rattus
Arabidopsis
Saccharomyces
cerevisiae
Other
5
4
3
2
1
0
1
0
1
0
1
0
1
0
1
ArrayExpress
GEO series from
ArrayExpress
Number (%) of
‘‘treatment’’
experimentsa
Number (%) of
‘‘treatment’’
experimentsa
861 (73%)
82 (7%)
238 (20%)
377 (32%)
317 (27%)
173 (15%)
159 (13%)
55 (5%)
691(93%)
54 (7%)
0 (0%)
264 (35%)
220 (30%)
126 (17%)
76 (10%)
15 (2%)
100 (8%)
216 (18%)
595 (50%)
309 (26%)
55 (5%)
6 (1%)
78 (7%)
1103 (93%)
551 (47%)
630 (53%)
347 (29%)
834 (71%)
131 (11%)
1050 (89%)
291 (21%)
890 (79%)
44 (6%)
81(11%)
415 (56%)
211 (28%)
38 (5%)
0 (0%)
0 (0%)
745 (100%)
421 (57%)
324(43%)
335 (45%)
410 (55%)
0 (0%)
745 (100%)
195 (26%)
550 (74%)
a
Note that the total number of ‘‘Treatment’’ experiments (or studies) will be
less than the total number of ‘‘Treatment’’ chemical-experiment pairs in Table
5 due to inclusion of experiments/studies that have tested multiple chemicals
(and/or used multiple platforms, etc.).
b
The Total MIAME score ranges from 0 to 5 and is a sum of the independent
scores of the five subcomponent scores, each of which takes on the value of
either 0 or 1 (absent or present).
c
Specific information about the design of the array or the platform used was
submitted (1) or not submitted (0) with the experiment by the submitter.
Included Array information is assigned an Array Accession number (see
Experimental_Accession, Table 3) within ArrayExpress.
d
A list of experimental factors was submitted (1) or not submitted (0) with
the experiment by the submitter; factors might include information on the cell
line or particular compounds and dose information used in the experiment.
e
Raw data was submitted (1) or not submitted (0) with the experiment by the
submitter.
f
Processed data was submitted (1) or not submitted (0) with the experiment
by the submitter.
g
Specific information about the experimental protocols used in the
experiment was submitted (1) or not submitted (0) with the experiment by the
submitter. Included Protocol information is assigned a Protocol Accession
number within ArrayExpress.
found significant numbers of experiments in both resources
mapped to families of similar chemicals, as well as to a broad
diversity of chemical structures, spanning a wide range of
367
toxicologically relevant chemical functional hierarchies and
classes (Supplemental Fig. 2).
A further metric of toxicological relevance is provided by
the overlap of unique ‘‘Treatment’’ chemical substances in
GEO and ArrayExpress with the current published DSSTox
inventory, which includes more than 10,000 unique chemical
substances, and spans a variety of environmentally and
toxicologically relevant chemical inventories and data sets
from various sources, including EPA, the National Toxicology
Program, and the U.S. Food and Drug Association (Richard
et al., 2008). At the time of this survey, more than 550 unique
chemical substances in the DSSTox GEO and/or ArrayExpress
files (GEOGSE and ARYEXP) corresponding to ‘‘Treatment’’
experiments are contained within one or more of the 11
previously published DSSTox Data Files (http://www.epa.gov/
ncct/dsstox/DataFiles.html), and there are a total of 1294
overlapping instances (i.e., some chemicals occur in multiple
DSSTox Data Files). Of these overlapping instances, three
chemical substances (Bisphenol A, di(2-ethylhexyl) phthalate
and dibutylphthalate) occur in eight DSSTox Data Files, and
a total of 309 chemical substances occur in two or more
DSSTox Data Files. These numbers indicate that significant
numbers of GEO and Array Express ‘‘Treatment’’ chemicalexperiments correspond to chemicals of potential toxicological
concern, for which additional in vitro or in vitro data may exist.
DISCUSSION
The term ‘‘chemogenomics’’ has been proposed to more
generally encompass the overlap of genomics technologies
with treatment-related chemical effects on biological systems,
including both toxicity-related and therapeutic effects (Fielden
and Kolaja, 2006). Chemogenomics adds a top-most chemical
layer to data organization, with broad chemical coverage of
standardized-protocol experiments a key requirement for
discerning activity patterns that can be confidently extrapolated
across chemical space. This approach and its implementation
are perhaps best exemplified by the Iconix DrugMatrixR
database and applications (Ganter et al., 2005). The Iconix
database consists of data generated for a single species (rat),
treated by more than 600 compounds in seven tissue types,
representing upwards of 3200 different drug-dose-time-tissue
combinations. The database covers five different domains of
data: microarray, clinical chemistry, hematology, organ weight,
and histopathology, and was built using a common microarray
platform and stringent experimental protocols and standards for
data generation and processing.
Whereas the Iconix database represents an ideal, practically
speaking, it is far removed from the reality of a public
microarray resource, upon which most public toxicogenomics
investigations must rely. In their role as primary repositories of
data associated with the published scientific literature, public
microarray data repositories such as GEO and ArrayExpress
368
WILLIAMS-DEVANE, WOLF, AND RICHARD
FIG. 1. Comparison of numbers of GEO Series and ArrayExpress Repository experiments, chemical-experiment pairs, and ‘‘Treatment’’ chemical-experiment
pairs, also showing overlapping content between the two systems; refer to totals and legends in Tables 4 and 5 (based on data extracted September 20, 2008).
cannot limit their content to include only experiments adhering
to strict common protocol standards and traditional model
organisms. A public data resource can, however, strive for
completeness and accuracy of experimental annotations and to
provide user-access to raw data for reanalysis. Similarly, the
accurate identification of a chemical in relation to an
experiment, particularly where the primary purpose of the
experiment is to discern effects of the chemical on a biological
system, should be considered as primary experimental
annotation and absolutely essential to experimental reproducibility. Whereas standardization and chemical indexing of
microarray experiments at the time of data deposition and
publication is the ideal, if minimally sufficient information (i.e.,
a valid chemical name, along with specification of the purpose
FIG. 2. Comparison of the total sets of unique chemicals pertaining to Treatment Chemical-Experiment pairs in ArrayExpress Repository and GEO Series
from the DSSTox data files; shown in each section are the chemicals mapping to the largest number of ‘‘Treatment’’ Chemical-Experiments in each case, with the
number of experiments shown in parentheses (GEO/ArrayExpress) (based on data extracted September 20, 2008).
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
of the chemical in relation to the experiment) is collected at the
time of data deposition in required data fields, formal chemical
indexing with structure annotation and quality review can be
performed efficiently with the appropriate chemical expertise in
collaboration with public efforts such as DSSTox and ChEBI
(Supplemental Table 1).
As the present survey has shown, although a number of
public microarray resources have the potential to support
toxicogenomics investigations, these resources currently represent a patchwork of disconnected or loosely connected
inventories and capabilities (Larsson and Sandberg, 2006),
having different goals, degrees of standardization, public data
accessibility, data mining ability, and utility for toxicogenomics investigation (Table 2; Supplemental Table 2). Primary
genomics resources (Table 1) serve as official MGEDsanctioned repositories for public gene expression data
associated with the scientific literature (Mattes et al., 2004;
Salter, 2005) with GEO and ArrayExpress, by far, the largest
and most important resources, currently. They both are
MIAME-supportive databases, meaning that they accept all
information about an experiment set forth by the MIAME
guidelines; however, they do not actually require this information. In addition, there is insufficient standardization
currently within GEO or ArrayExpress pertaining to protocol
or experimental description to fully support exploration within
and across these resources (Table 2). Secondary genomics
resources contain genomic-related data but generally have
more limited content and are designed for more specialized
purposes and applications (Table 1).
With its specific focus on toxicogenomics, attention to
chemical indexing, and addition of the BID system, CEBS is
worthy of special mention, having incorporated many elements
of an ideal toxicogenomics resource. To support robust
relational searching for toxicogenomics, CEBS has the added
task of capturing and systematizing user-deposited data
pertaining to a study. CEBS bridges the gap between an
open-access, user-depositor system and a relational, curated
database by instituting a high degree of standardization and
data controls that extend beyond MIAME guidelines (Fostel
et al., 2005). CEBS is striving for much larger coverage of
chemical space in relation to chemical treatment experiments.
In collaboration with DSSTox, and building on current
annotation efforts of GEO and ArrayExpress, CEBS will
provide structure-searching capabilities and chemical linkages
to external public resources, such as PubChem. In addition,
CEBS will provide direct access to GEO and ArrayExpress
‘‘Treatment’’ chemical-experiment content, as well as automated secondary deposition of CEBS content to GEO.
A most noteworthy deficiency of most secondary genomics
resources and the two primary genomics resources—ArrayExpress and GEO—highlighted in the present study, is the
complete lack of incorporation of chemical annotation and
standards that would allow aggregation of data for the same or
similar chemicals, and linkage to growing lists of chemically
369
indexed resources (Supplemental Table 2). Due to the lack of
chemical-reporting standards, the process for identifying
chemical treatment-related experiments in ArrayExpress Repository and GEO Series in this study was time-consuming and
difficult to automate (Supplemental Fig. 1 and Supplemental
Example 1). Present efforts serve to highlight deficiencies in
microarray experiment data deposition requirements and
standards with regard to chemistry and chemical treatment–
related experiments that, if better addressed, could greatly
facilitate chemical annotation and data integration efforts in the
future. With formal chemical annotation, it becomes possible to
assess the chemical coverage of public gene expression
databases, to link data for common or similar chemicals across
information domains, including toxicology, as well as to gather
data from comparable experiments, possibly performed in
different labs and species, that can begin to serve as the basis
for meta-analysis or structure-activity hypotheses. Furthermore,
the proposed set of Standard Genomics Fields, most of which
map to existing fields from both GEO and ArrayExpress, serve
to bridge the two resources and facilitate comparisons and
incorporation of their content into other resources in a standardized way.
CONCLUSION
It is hoped that the current exercise to create, publish, and
link chemical-index files for GEO Series and ArrayExpress
Repository has had two primary impacts: (1) to highlight
deficiencies in the current chemical annotation and curation
methods within ArrayExpress and GEO that particularly impact
toxicogenomics applications of these resources; and (2) to
show the way forward in terms of the potential benefits that can
be derived by incorporating robust chemical annotation and
linkages of chemical treatment-related content to these public
resources. Recently improved coordination of the EBI
ArrayExpress and ChEBI projects, whereby ChEBI provides
link-outs from chemical structure to particular ArrayExpress
experiments (currently only provided for a handful of experiments for which ArrayExpress data submitters provided ChEBI
identifiers), is a significant step forward and should immediately benefit from incorporation of the DSSTox ArrayExpress
chemical-experiment index file, as well as the addition of the
corresponding DSSTox GEO index file. However, as is
apparent from past failures, it is not sufficient to recommend
that users add accurate chemical information at the time of data
submission unless more stringent efforts to require this
information are instituted. In addition, we strongly recommend
adoption of the ‘‘Chemical_StudyType’’ categories, or something comparable, for each chemical-associated study or
experiment deposited into GEO and ArrayExpress. Finally,
recognizing that GEO and ArrayExpress are not designed
primarily as toxicogenomics resources, submitters of explicit
toxicogenomic study data should be strongly encouraged to
370
WILLIAMS-DEVANE, WOLF, AND RICHARD
initially deposit studies into CEBS as a way to ensure capture
of sufficient toxicogenomics experimental description, utilizing
the automated deposition capabilities of CEBS to secondarily
deposit well annotated, chemically indexed data into GEO.
Postscript: All of the initial published DSSTox chemical
files and results reported here were based on data extracted
from the ArrayExpress Repository and GEO Series on
September 20, 2008. Subsequent updates of both ArrayExpress
Repository and GEO Series chemical-index files, based on data
extracted on January 20, 2009 and February 2, 2009,
respectively, have been published on the DSSTox website
and incorporated into PubChem as of March 2009; these
updated files do not change the overall trends or conclusions of
the present study.
Davis, A. P., Murphy, C. G., Saraceni-Richards, C. A., Rosenstein, M. C.,
Wiegers, T. C., and Mattingly, C. J. (2009). Comparative toxicogenomics
database: A knowledgebase and discovery tool for chemical-gene-disease
networks. Nucleic Acids Res. 37(Database issue), D786–D792.
SUPPLEMENTARY DATA
Dix, D. J., Houck, K. A., Martin, M. T., Richard, A. M., Setzer, R. W., and
Kavlock, R. J. (2007). The ToxCast program for prioritizing toxicity testing
of environmental chemicals. Toxicol. Sci. 95, 5–12.
Supplementary data are available online at http://toxsci.
oxfordjournals.org/.
FUNDING
NCSU/EPA Cooperative Training Program in Environmental Sciences Research, Training Agreement (CT833235-01-0)
with North Carolina State University supported C.R.W.
ACKNOWLEDGMENTS
We would like to thank Drs Jennifer Fostel (CEBS), Chihae
Yang (FDA Center for Food Safety and Nutrition), David Dix
(EPA), and William Ward (EPA) for helpful comments and
suggestions in review of this manuscript. This work was
carried out by C.R.W. as part of a graduate research project
within the Bioinformatics Program at North Carolina State
University; thesis is publicly accessible at http://www.lib.
ncsu.edu/theses/available/etd-12112008-214342/.
REFERENCES
Ball, C., Brazma, A., Causton, H., Chervitz, S., Edgar, R., Hingamp, P.,
Matese, J. C., Icahn, C., Parkinson, H., Quackenbush, J., et al. (2004).
Microarray Gene Expression Data (MGED) Society. Standards for microarray data: An open letter. Environ. Health Perspect. 112, A666–A667.
Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Rudnev, D.,
Evangelista, C., Kim, I. F., Soboleva, A., Tomashevsky, M., and
Edgar, R. (2007). NCBI GEO: Mining tens of millions of expression
profiles—Database and tools update. Nucleic Acids Res. 35(Database issue),
D760–D765.
Barrett, T., and Edgar, R. (2006). Gene expression omnibus: Microarray data
storage, submission, retrieval, and analysis. Methods Enzymol. 411,
352–369.
Boyle, J. (2005). Gene-Expression Omnibus integration and clustering tools in
SeqExpress. Bioinformatics 21, 2550–2551.
Brazma, A., Kapushesky, M., Parkinson, H., Sarkans, U., and Shojatalab, M.
(2006). Data storage and analysis in ArrayExpress. Methods Enzymol. 411,
370–386.
Burgoon, L. D. (2007). Clearing the standards landscape: The semantics of
terminology and their impact on toxicogenomics. Toxicol. Sci. 99, 403–412.
Burgoon, L. D., Boutros, P. C., Dere, E., and Zacharewski, T. R. (2006).
dbZach: A MIAME-compliant toxicogenomic supportive relational database.
Toxicol. Sci. 90, 558–568.
Chen, J., Zhao, P., Massaro, D., Clerch, L. B., Almon, R. R., DuBois, D. C.,
Jusko, W. J., and Hoffman, E. P. (2004). The PEPR GeneChip data
warehouse, and implementation of a dynamic time series query tool (SGQT)
with graphical interface. Nucleic Acids Res. 32(Database issue), D578–D581.
Fielden, M. R., and Kolaja, K. L. (2006). The state-of-the-art in predictive
toxicogenomics. Curr. Opin. Drug Discov. Dev. 9, 84–91.
Fostel, J. M. (2008). Towards standards for data exchange and integration and
their impact on a public database such as CEBS (Chemical Effects in
Biological Systems). Toxicol. Appl. Pharmacol. 233, 54–62.
Fostel, J. M., Burgoon, L., Zwickl, C., Lord, P., Corton, J. C., Bushel, P. R.,
Cunningham, M., Fan, L., Edwards, S. W., Hester, S., et al. (2007). Toward
a checklist for exchange and interpretation of data from a toxicology study.
Toxicol. Sci. 99, 26–34.
Fostel, J., Choi, D., Zwickl, C., Morrison, N., Rashid, A., Hasan, A., Bao, W.,
Richard, A., Tong, W., Bushel, P., et al. (2005). Chemical Effects in
Biological Systems—Data dictionary (CEBS-DD): A compendium of terms
for the capture and integration of biological study design description,
conventional phenotypes, and ‘omics data. Toxicol. Sci. 88, 585–601.
Ganter, B., Tugendreich, S., Pearson, C. I., Ayanoglu, E., Baumhueter, S.,
Bostian, K. A., Brady, L., Browne, L. J., Calvin, J. T., Day, G. J., et al.
(2005). Development of a large-scale chemogenomics database to improve
drug candidate selection and to understand mechanisms of chemical toxicity
and action. J. Biotechnol. 119, 219–244.
Gomase, V. S., Tagore, S., and Kale, K. V. (2008). Microarray: An approach
for current drug targets. Curr. Drug Metab. 9, 221–231.
Hamadeh, H. K., Amin, R. P., Paules, R. S., and Afshari, C. A. (2002). An
overview of toxicogenomics. Curr. Issues Mol. Biol. 4, 45–56.
Hayes, K. R., Vollrath, A. L., Zastrow, G. M., McMillan, B. J., Craven, M.,
Jovanovich, S., Rank, D. R., Penn, S., Walisser, J. A., Reddy, J. K., et al.
(2005). EDGE: A centralized resource for the comparison, analysis, and
distribution of toxicogenomic information. Mol. Pharmacol. 67, 1360–1368.
Hirabayashi, Y., and Inoue, T. (2002). Toxicogenomics—A new paradigm of
toxicology and birth of reverse toxicology. Kokuritsu Iyakuhin Shokuhin
Eisei Kenkyusho Hokoku 120, 39–52.
Ivliev, A. E., ‘t Hoen, P. A., Villerius, M. P., den Dunnen, J. T., and Brandt, B. W.
(2008). Microarray retriever: A web-based tool for searching and large scale
retrieval of public microarray data. Nucleic Acids Res. 36, W327–W331.
Larsson, O., and Sandberg, R. (2006). Lack of correct data format and
comparability limits future integrative microarray research. Nat. Biotechnol.
24, 1322–1323.
Martin, M. T., Judson, R. S., Reif, D. M., and Dix, D. J. (2009). Profiling
chemicals based on chronic toxicity results from the U. S. EPA ToxRef
Database. Environ. Health Perspect. 117, 392–399.
Mattes, W. B., Pettit, S. D., Sansone, S. A., Bushel, P. R., and Waters, M. D.
(2004). Database development in toxicogenomics: Issues and efforts.
Environ. Health Perspect. 112, 495–505.
CHEMICAL INDEXING OF TOXICOGENOMICS RESOURCES
Parkinson, H., Kapushesky, M., Shojatalab, M., Abeygunawardena, N.,
Coulson, R., Farne, A., Holloway, E., Kolesnykov, N., Lilja, P., Lukk, M.,
et al. (2007). ArrayExpress—A public database of microarray experiments
and gene expression profiles. Nucleic Acids Res. 35(Database issue),
D747–D750.
Richard, A., Yang, C., and Judson, R. (2008). Toxicity data informatics:
Supporting a new paradigm for toxicity prediction. Toxicol. Mech. Methods
18, 103–118.
Richard, A. M., Gold, L. S., and Nicklaus, M. C. (2006). Chemical structure
indexing of toxicity data on the internet: Moving toward a flat world. Curr.
Opin. Drug Discov. Dev. 9, 314–325.
Rustici, G., Kapushesky, M., Kolesnikov, N., Parkinson, H., Sarkans, U., and
Brazma, A. (2008). Data storage and analysis in ArrayExpress and
expression profiler. Curr. Protoc. Bioinformatics. Chap. 7, Unit 7.13.
Salter, A. H. (2005). Large-scale databases in toxicogenomics. Pharmacogenomics 6, 749–754.
Tateno, Y., and Ikeo, K. (2004). International public gene expression database
(CIBEX) and data submission. Tanpakushitsu Kakusan Koso 49,
2678–2683.
Taylor, C. F., Field, D., Sansone, S. A., Aerts, J., Apweiler, R., Ashburner, M.,
Ball, C. A., Binz, P. A., Bogue, M., Booth, T., et al. (2008). Promoting
coherent minimum reporting guidelines for biological and biomedical
investigations: The MIBBI project. Nat. Biotechnol. 26, 889–896.
Waters, M., Boorman, G., Bushel, P., Cunningham, M., Irwin, R., Merrick, A.,
Olden, K., Paules, R., Selkirk, J., Stasiewicz, S., et al. (2003). Systems
toxicology and the chemical effects in biological systems (CEBS) knowledge
base. Environ. Health Perspect. Toxicogenomics 111, 15–28.
371
Waters, M., Stasiewicz, S., Merrick, B. A., Tomer, K., Bushel, P., Paules, R.,
Stegman, N., Nehls, G., Yost, K. J., Johnson, C. H., et al. (2008).
CEBS—Chemical effects in biological systems: A public data repository
integrating study design and toxicity data with microarray and proteomics
data. Nucleic Acids Res. 36(Database issue), D892–D900.
Wheeler, D. L., Barrett, T., Benson, D. A., Bryant, S. H., Canese, K.,
Chetvernin, V., Church, D. M., Dicuccio, M., Edgar, R., Federhen, S., et al.
(2008). Database resources of the National Center for Biotechnology
Information. Nucleic Acids Res. 36(Database issue), D13–D21.
Williams-Devane, C. R., Wolf, M. A., and Richard, A. M. (2009). DSSTox
Chemical-Index files for exposure-related experiments in ArrayExpress and
Gene Expression Omnibus: Enabling toxico-chemogenomics data linkages.
Bioinformatics 25, 692–694.
Yang, C., Benz, R. D., and Cheeseman, M. A. (2006a). Landscape of current
toxicity databases and database standards. Curr. Opin. Drug Discov. Dev. 9,
124–133.
Yang, C., Hasselgren, C. H., Boyer, S., Arvidson, K., Aveston, S., Diekes, P.,
Benigni, R., Benz, R. D., Contrera, J., Kruhlak, N. L., et al. (2008).
Understanding genetic toxicity through data mining: The process of building
knowledge by integrating multiple genetic toxicity databases. Toxicol. Mech.
Methods 18, 277–295.
Yang, C., Richard, A. M., and Cross, K. P. (2006b). The art of data mining the
minefields of toxicity databases to link chemistry to biology. Curr. Comput.
Aided Drug Design 2, 135–150.
Zhu, Y., Zhu, Y., and Xu, W. (2008). EzArray: A web-based highly automated
Affymetrix expression array data management and analysis system. BMC
Bioinformatics 9, 46.