Download PDF file

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
TTEC June 2000
4/5/00
11:09 am
Page 228
FORUM
and associations. However, as these
warehouses grow rapidly in size, data
storage becomes increasingly risky in
that the data will be of variable or
unverified quality, leading to false
leads and associations.
This has led to the ‘curated’ database, where data is ‘quality assured’
by curators or editors before incorporation (e.g. the Bioknowledge
library presented by William Payne,
Proteome, Beverly, MA, USA). The
advantages of this type of system are
rapid access to well-linked datasets
and search algorithms resulting in
reduced time spent in library
research. However, this too might
have the disadvantage of masking
subtle features or associations in the
primary data because it is dependent
on the human judgement of the
editor to decide what is sufficiently
important to be incorporated. There
might also be many datasets wherein
there is not a strong enough linkage
between data items (as there is in
members of the same gene family or
receptor type) for this to be a reality.
Knowledge access
In order to capitalize on systems
used for the integration of distributed
database systems and knowledge bases,
these systems, regardless of their
Meeting
report
Bioinformatics meets data mining:
time to dance?
have just been reminded of my first
Idivided
high-school dance. Sexes equally
and fidgeting on opposite
sides of the dance f loor, each with
expectant looks on their faces. Each
wanting to couple and tango, but not
quite knowing how to make the first
move. On each side of the room
there are varying levels of maturity,
but no-one wants to miss out on
something, whatever that something
might be. A similar feeling was present at a recent two-day conference
on data mining in bioinformatics*.
Finally, one brazen soul, usually a
male in my day but thankfully now
*The
conference on Data Mining in Bioinformatics was held at the European Bioinformatics
Institute, Hinxton, UK, 10–12 November 1999.
228
implementation, should support
several key functions:
• allow data and knowledge to be
effectively combined for research
and management functions;
• be able to support decision making (e.g. in effective target identification);
• ensure maximum usage of existing
personnel skills and experience
(i.e. the ‘human knowledge base’);
and
• aid cross-discipline communication
and data exchange.
Several systems were presented
using a variety of established technologies, such as the use of CORBA
(Tim Clark, Millennium Pharmaceuticals, Cambridge, MA, USA),
and proprietary integration systems
such as the Sequence Retrieval
System (SRS; Reinhard Schneider,
LION Bioscience AG, Heidelberg,
Germany). In principle, both these
systems enable linkage between disparate datasets and access to these sets
via several established and novel visualization tools. The inclusion of access
to familiar tools (e.g. BLAST or other
database-search tools) will be very
important in ensuring that researchers
readily take up the new system as
opposed to viewing it as an additional
burden they have to master.
likely to be of either sex, makes a
move forward. Slowly everybody
joins in and a good time is had by
all. The two sides of the room at the
European Bioinformatics Institute
were the biologists and bioinformaticians on one side and the computer
scientists on the other. The ‘brazen
soul’ driving us to the dance floor
was the promise of being able to
analyse the vast amount of geneexpression data being collected
worldwide using microarray technology. There will definitely be more
dances in the future now that we
have got to know our opposite numbers a little better. Part of the growing up process is learning about your
partner, and that was the major focus
of the meeting; but let us start with
the brazen soul.
Changes in culture
From the presentations made at
this meeting, it became clear that
there is a considerable culture change
under way in the manner in which data
is managed in large companies and
research bodies. This change encompasses three levels: (1) strategic planning, (2) bioinformatics implementation
and (3) the actual research process.
At a strategic level, to allow bioinformatics specialists to implement
systems that can easily grow and
evolve with the research base. At the
bioinformatics level, the implementation of new visualization and
analysis tools, while remembering
that data integration is not the end
point – the end point is the use to
which you want to put the data. Further, at the research level, where data
submission to a central knowledge
base becomes an easy and regular
task. In addition, data submission and
mining should be strongly coupled
with a culture of the exchange of ideas
and feedback with bioinformatics
groups forming a positive feedback
loop for future developments.
Mark Strivens
Medical Research Council Mouse Genome Centre,
Harwell, Oxon, UK OX11 0RD.
(E-mail: [email protected])
Microarray technology as the
brazen soul
As Paul Spellman (Stanford University, CA, USA) so aptly showed,
we are at a new frontier in biology,
and this is wonderfully captured in
Fig. 1. Figure 1 shows the complete
expression pattern of yeast, consisting of over 6000 genes, covering all
aspects of the cell cycle, sporulation
and nutritional variation, as well as
stress responses to heat shock and
oxidative stress (the latter two being
remarkably similar). The dataset
contains approximately 2.5 million
independent, noisy and in some way
correlated observations; a perfect
partner for someone interested in
data mining.
Databases and data mining
Participants on one side of the
room were given several lessons in
the principles of data mining, which
are themselves still evolving within
the field of computer science. The
opening address by Heikki Mannila
0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01430-X
TIBTECH JUNE 2000 (Vol. 18)
TTEC June 2000
4/5/00
11:09 am
Page 229
FORUM
Figure 1
A gene-expression map of Saccharomyces cerevisiae1–3; courtesy of Paul Spellman. Abbreviations: starv, starvation; redox, oxidative stress;
MMS, exposure to MMS; osm, exposure to osmotic shock; MMS, methylmethane sulfonate; spo, sporulation; germ, germination; carb, various
carbon sources.
(University of Helsinki, Helsinki,
Finland) pointed out that data mining is much more than the application of standard database queries.
Certainly, the underlying data model
is critical to the mining exercise, in
terms of both access efficiency and
associated data, but there is first the
issue of local analysis vs global analysis. Local analysis might only be one
particular protein family within a
large corpus, assuming that the data
associated with that family can be
identified. The methodology, for
example, clustering, might not be
applied as it would be to a global
study of the complete corpus. Data
mining in both instances is a process
that analyses data to generate descriptive and/or predictive models that
can be used to understand patterns or
relationships in the data. Associated
with this modeling must be a good
statistical analysis and, of course,
visualization of the findings; all these
were covered at the conference.
Algorithms for pattern matching
include decision trees, Naïve-Bayes
association, K-nearest neighbor and
TIBTECH JUNE 2000 (Vol. 18)
neural networks, which are already
familiar to many bioinformaticians,
but generally in the context of motif
discovery, tandem repeat finding,
and so on, associated with DNA
sequences and not diffuse clusters of
array expression or other data. Recognized problems in data mining
included lack of use, thus far, of
Bayesian approaches, the often temporal nature of the data being
studied, how to interpret outliers and
the ability to use background knowledge to strengthen the predictive
outcome. However, it should be
noted that Ron Taylor (National
Cancer Institute, National Institutes
of Health, Bethesda, MD, USA)
described a Bayesian similarity measure applied to different experiments
in a large array database.
Given the importance of the underlying database to data mining, several
posters and presentations described
new integrative approaches for accessing data from different biological
sources, with emphasis on linking
gene identification to available functional information. Rolf Apweiler
(EMBL–EBI, Hinxton, UK) described
InterPro, an integrated documentation resource for protein families,
domains and functional sites. Importantly, there was a follow-on conference for those interested in standards
for the storage of microarray data.
As someone involved in applying
improved and consistent annotation
to the 27 years of legacy macromolecular structure data as found in the
Protein Data Bank, it would be useful to get the storage of array data
right, from the beginning. Already,
there might be a divide between what
the European Biotechnology Institute is spearheading and the recent
indications from the US National
Center of Biotechnology Information.
Finally, there were talks and posters
on approaches to the visualization of
massive datasets. David Gilbert (City
University, London, UK) presented
particularly exciting work, beyond the
use of unrooted trees, for the display
of pairwise comparison data; he and
his colleagues use two new threedimensional (3D) clustering algorithms
for the visualization of a 3D space.
229
TTEC June 2000
4/5/00
11:09 am
Page 230
FORUM
The conference (http://industry.
ebi.ac.uk/datamining99) highlighted
for me that at the moment we are
stumbling, rather than dancing, but I
have no doubt that there will be
more dances, and new biology will
be forthcoming as we learn the steps.
As the Kinks say, ‘Come dancing’.
ver the past few years, the
O
field of pharmacogenomics has
emerged, bringing together novel
techniques that might transform the
way in which we go about treating
and preventing disease, from the discovery of new drugs to the tailoring
of medical therapy to the individual.
Pharmacogenomics was the subject
of a recent conference*, the programme of which was divided into
three sections: (1) examining the
latest developments in the pharmacogenomics revolution, (2) understanding the implications of the
SNP Consortium, and (3) examining
the long-term regulatory impact of
pharmacogenomics.
Latest developments
Klaus Lindpainter (Roche Genetics,
Basel, Switzerland) gave an excellent
introductory talk on transforming
the future of pharmaceutical R&D.
He began by focusing on the current
position of the pharmaceutical industry: there are either none or ineffective drugs for many diseases, there
are substantial inter-individual differences in drug efficiency and, worryingly, the incidence of adverse events
is significant. It is informative to consider these areas and, in particular,
why do people respond to drugs in
an individual-dependent manner?
Günther Heinrich (Epidauros Biotechnologie AG, Bernried, Germany)
stressed that the underlying reason
for the current problems of drug
development and therapy is the
genetic diversity of Homo sapiens.
Individual tailoring of health
care
Klaus Lindpainter (Roche Genetics)
discussed the goal of pharmacogenomics: the tailoring of medicine to
*The
2nd Annual Pharmacogenomics Event was
held in London, UK, 18–20 January 2000.
230
San Diego Supercomputer Center and Department
of Pharmacology, University of California,
San Diego, CA 92093, USA.
(E-mail: [email protected])
References
1 DeRisi, J.L. et al. (1997) Exploring the metabolic and genetic control of gene expression
Evolution, not revolution
Meeting
report
on a genomic scale. Science 278, 680–686
2 Chu, S. et al. (1998) The transcriptional
program of sporulation in budding yeast.
Science 282, 699–705
3 Spellman, P.T. et al. (1998) Comprehensive identification of cell cycle-regulated
genes of the yeast Saccharomyces cerevisiae by
microarray hybridization. Mol. Biol. Cell
12, 3273–3297
Philip E. Bourne
the individual, noting that this system
would be probabilistic, rather than
deterministic. Particularly for complex
diseases, many gene variants will be
involved, in addition to environmental factors. One point in particular prompted the title of this report,
that is, that more genetic testing will
not provide a paradigm shift, there
will be no quantum leap, but incremental progress will hopefully be
made; the next few years will see
pharmacogenomics undergoing evolution, rather than a revolution.
Importantly, the issues regarding
the public were discussed, and it is
essential that these are focused on.
They include widespread concern
that genetic information must be
used appropriately for the benefit
of humankind, confidentiality and
ownership, particularly with regard
to the necessity of a legal framework
to protect individuals and to enable
the legitimate and beneficial use of
genetic information.
George Poste (SmithKline Beecham,
UK) also discussed the evolution of
rational health care, that is, the design
of increasingly rational therapeutics,
with the focus on the genetic background of the patient (recognizing
the effect of individual variation on
their response to therapy), and the
preventive treatment involving presymptomatic and pre-dispositional
(also noting that these are probabilistic not absolute risks) counselling,
with emphasis on the importance of
regulatory issues, the inadequate scale
of genetic counselling, and the issue
of discrimination in insurance and
employment. He forecasted that
individual information will be contained on ‘smart cards’, and that a
convergence between medicine and
computing is urgently required.
Gualberto Ruaño (Genaissance
Pharmaceuticals, CT, USA) presented
the role of Genaissance Pharmaceuticals
in connecting genetics to the
response of individuals to clinical
therapy. A closer look at the development of clinical trials is required;
the role of genetic variability in the
success or failure of drugs to progress
from Phase II to Phase III of clinical
trials might previously have been
underestimated. Herbert Schuster
(Infogen, Berlin-Buch, Germany)
discussed the next ‘loop’ after the
sequencing of the entire human
genome, the identification of potentially clinically relevant novel drug
targets, and (ultimately) the development of drugs that are efficient,
specific, universal, tolerable and free
of charge.
The SNP Consortium
This fundamental project was presented by Arthur Holden (the SNP
Consortium, USA); briefly, the SNP
Consortium is a non-profit organization that is funded by the contributions of its members. Its aim is to
identify and map human singlenucleotide polymorphisms (SNPs),
and for this database to be in the
public domain. The key objectives
include the creation of the highestquality SNP map available, to identify a minimum of 300 000 SNPs, to
map at least 170 000 SNPs and to
maximize public accessibility. It is an
ambitious project involving pharmaceutical companies, academic centres
and charities.
Amalgamation of specific
skills
An interesting session was held on
the importance of alliances and collaborations in securing the integration of pharmacogenomics into
drug-development programmes. In
his talk, Michael Murphy (Pharmacogenomics Services, La Jolla, CA,
USA) presented a case study that
clearly demonstrated the result of
pharmaceutical companies liaising
with pharmacogenomics specialists.
Claire Allan (Glaxo Wellcome, UK)
continued on this theme, discussing
0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01450-5
TIBTECH JUNE 2000 (Vol. 18)