Download Toward Deciphering the Knowledge Encrypted in Large Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein moonlighting wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

List of types of proteins wikipedia , lookup

Proteomics wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Transcript
Editorial
Toward Deciphering the Knowledge Encrypted
in Large Datasets
Alma L. Burlingame
In the discussions that took place during formulation of the
charter of this journal, it was clear that the research community had embarked on a challenging quest to identify and
describe the totality of cellular protein expression in unprecedented detail. In addition, interest in exploring global studies
of quantitative changes in levels of expression or posttranslational status would require development of new methodologies, but such a focus would provide even deeper insight
into factors that affect physiology and phenotype and normal
or aberrant function. It was presumed that particular aspects
of such knowledge would be revealed quickly driven by powerful mass spectrometric technologies that exploit the growing scale of information provided by the encyclopedia of
genomic information.
Many recent forays are being guided by strategies based on
so-called “discovery science” gathered by advanced technologies that possess unprecedented power to scrutinize complex systems in precise molecular detail. Use of such revolutionary tools are producing global measurements of message
levels and ensembles of protein translation products. Thus,
very large amounts of rather precise new information on gene
expression and its modulation are accumulating in a growing
number of laboratories. The challenges focus on deciphering
the multi-functional, network, and even phenotypic contexts
embedded in these data.
A sobering characteristic of large-scale measurements of
gene expression is the present inability of any single laboratory to understand or fully comprehend information gathered
on such comprehensive scale. Hence, much of the information content implicit in such large datasets remains cryptic
and must await future mining endeavors to “reveal” the full
biological meaning. One strategy that might accelerate the
process of understanding would be to stimulate the participation of previously unidentified interested parties. This could
be accomplished by making these datasets accessible per se
together with the results deduced from the group that carried
out the original experiments. In so doing, one should bear in
mind that such data sets will become rapidly “out-dated” by
the ongoing expansion and improvement of sequence databases and the analytical tools available for their interpretation.
Thus, in order to fully realize the potential of publicly sharing
© 2003 by The American Society for Biochemistry and Molecular Biology, Inc.
This paper is available on line at http://www.mcponline.org
a growing base of large scale datasets, such datasets must
not only be amenable to inquiry with newer analytical tools as
they become available, but that some form of statistical, probability-based, or otherwise objective and scientifically validated determination of “confidence” levels be applied to the
results. This is vital to allow for the drawing of meaningful
biological information from the data. Perhaps more importantly, this will also allow for the comparison of related datasets produced by other research groups as they become
available. This will facilitate deriving yet more meaningful biological information not apparent from the individual datasets.
Thus, by provision of accessibility to such suitably annotated
data, the discovery of additional meaning and insight not
recognized by the original scientific work may proceed.
It is our intention: to articulate the nature of problems associated with the quality of such datasets and suitability of
current tools to deal with them; to facilitate the further mining
of such datasets; and to foster critical discussion of these
topics and issues within the community. Molecular & Cellular
Proteomics is committed to providing: 1) community access
to current research examples of these complex research
problems, 2) a forum for articulation and adjudication of technical and scientific issues surrounding large-scale biological
research, and 3) a vehicle for formulation of standards through
community participation, critique, and consensus. To this
end, the publication in this issue of one such dataset, along
with the original raw data used to generate it (1, 2), represents
a first step toward realizing our intention.
REFERENCES
1. von Haller, P. D., Yi, E., Donohoe, S., Vaughn, K., Keller, A., Nesvizhskii,
A. I., Eng, J., Li, X., Goodlett, D. R., Aebersold, R., and Watts, J. D. (2003)
The application of new software tools to quantitative protein profiling via
ICAT and tandem mass spectrometry: I. Statistically annotated datasets
for peptide sequences and proteins identified via the application of ICAT
and tandem mass spectrometry to proteins co-purifying with T cell lipid
rafts. Mol. Cell. Proteomics 2, 426 – 427
2. von Haller, P. D., Yi, E., Donohoe, S., Vaughn, K., Keller, A., Nesvizhskii,
A. I., Eng, J., Li, X., Goodlett, D. R., Aebersold, R., and Watts, J. D. (2003)
The application of new software tools to quantitative protein profiling via
ICAT and tandem mass spectrometry: II. Evaluation of tandem mass
spectrometry methodologies for large-scale protein analysis, and the
application of statistical tools for data analysis and interpretation. Mol.
Cell. Proteomics 2, 428 – 442
Molecular & Cellular Proteomics 2.7
425