Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
TTEC June 2000 4/5/00 11:09 am Page 228 FORUM and associations. However, as these warehouses grow rapidly in size, data storage becomes increasingly risky in that the data will be of variable or unverified quality, leading to false leads and associations. This has led to the ‘curated’ database, where data is ‘quality assured’ by curators or editors before incorporation (e.g. the Bioknowledge library presented by William Payne, Proteome, Beverly, MA, USA). The advantages of this type of system are rapid access to well-linked datasets and search algorithms resulting in reduced time spent in library research. However, this too might have the disadvantage of masking subtle features or associations in the primary data because it is dependent on the human judgement of the editor to decide what is sufficiently important to be incorporated. There might also be many datasets wherein there is not a strong enough linkage between data items (as there is in members of the same gene family or receptor type) for this to be a reality. Knowledge access In order to capitalize on systems used for the integration of distributed database systems and knowledge bases, these systems, regardless of their Meeting report Bioinformatics meets data mining: time to dance? have just been reminded of my first Idivided high-school dance. Sexes equally and fidgeting on opposite sides of the dance f loor, each with expectant looks on their faces. Each wanting to couple and tango, but not quite knowing how to make the first move. On each side of the room there are varying levels of maturity, but no-one wants to miss out on something, whatever that something might be. A similar feeling was present at a recent two-day conference on data mining in bioinformatics*. Finally, one brazen soul, usually a male in my day but thankfully now *The conference on Data Mining in Bioinformatics was held at the European Bioinformatics Institute, Hinxton, UK, 10–12 November 1999. 228 implementation, should support several key functions: • allow data and knowledge to be effectively combined for research and management functions; • be able to support decision making (e.g. in effective target identification); • ensure maximum usage of existing personnel skills and experience (i.e. the ‘human knowledge base’); and • aid cross-discipline communication and data exchange. Several systems were presented using a variety of established technologies, such as the use of CORBA (Tim Clark, Millennium Pharmaceuticals, Cambridge, MA, USA), and proprietary integration systems such as the Sequence Retrieval System (SRS; Reinhard Schneider, LION Bioscience AG, Heidelberg, Germany). In principle, both these systems enable linkage between disparate datasets and access to these sets via several established and novel visualization tools. The inclusion of access to familiar tools (e.g. BLAST or other database-search tools) will be very important in ensuring that researchers readily take up the new system as opposed to viewing it as an additional burden they have to master. likely to be of either sex, makes a move forward. Slowly everybody joins in and a good time is had by all. The two sides of the room at the European Bioinformatics Institute were the biologists and bioinformaticians on one side and the computer scientists on the other. The ‘brazen soul’ driving us to the dance floor was the promise of being able to analyse the vast amount of geneexpression data being collected worldwide using microarray technology. There will definitely be more dances in the future now that we have got to know our opposite numbers a little better. Part of the growing up process is learning about your partner, and that was the major focus of the meeting; but let us start with the brazen soul. Changes in culture From the presentations made at this meeting, it became clear that there is a considerable culture change under way in the manner in which data is managed in large companies and research bodies. This change encompasses three levels: (1) strategic planning, (2) bioinformatics implementation and (3) the actual research process. At a strategic level, to allow bioinformatics specialists to implement systems that can easily grow and evolve with the research base. At the bioinformatics level, the implementation of new visualization and analysis tools, while remembering that data integration is not the end point – the end point is the use to which you want to put the data. Further, at the research level, where data submission to a central knowledge base becomes an easy and regular task. In addition, data submission and mining should be strongly coupled with a culture of the exchange of ideas and feedback with bioinformatics groups forming a positive feedback loop for future developments. Mark Strivens Medical Research Council Mouse Genome Centre, Harwell, Oxon, UK OX11 0RD. (E-mail: [email protected]) Microarray technology as the brazen soul As Paul Spellman (Stanford University, CA, USA) so aptly showed, we are at a new frontier in biology, and this is wonderfully captured in Fig. 1. Figure 1 shows the complete expression pattern of yeast, consisting of over 6000 genes, covering all aspects of the cell cycle, sporulation and nutritional variation, as well as stress responses to heat shock and oxidative stress (the latter two being remarkably similar). The dataset contains approximately 2.5 million independent, noisy and in some way correlated observations; a perfect partner for someone interested in data mining. Databases and data mining Participants on one side of the room were given several lessons in the principles of data mining, which are themselves still evolving within the field of computer science. The opening address by Heikki Mannila 0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01430-X TIBTECH JUNE 2000 (Vol. 18) TTEC June 2000 4/5/00 11:09 am Page 229 FORUM Figure 1 A gene-expression map of Saccharomyces cerevisiae1–3; courtesy of Paul Spellman. Abbreviations: starv, starvation; redox, oxidative stress; MMS, exposure to MMS; osm, exposure to osmotic shock; MMS, methylmethane sulfonate; spo, sporulation; germ, germination; carb, various carbon sources. (University of Helsinki, Helsinki, Finland) pointed out that data mining is much more than the application of standard database queries. Certainly, the underlying data model is critical to the mining exercise, in terms of both access efficiency and associated data, but there is first the issue of local analysis vs global analysis. Local analysis might only be one particular protein family within a large corpus, assuming that the data associated with that family can be identified. The methodology, for example, clustering, might not be applied as it would be to a global study of the complete corpus. Data mining in both instances is a process that analyses data to generate descriptive and/or predictive models that can be used to understand patterns or relationships in the data. Associated with this modeling must be a good statistical analysis and, of course, visualization of the findings; all these were covered at the conference. Algorithms for pattern matching include decision trees, Naïve-Bayes association, K-nearest neighbor and TIBTECH JUNE 2000 (Vol. 18) neural networks, which are already familiar to many bioinformaticians, but generally in the context of motif discovery, tandem repeat finding, and so on, associated with DNA sequences and not diffuse clusters of array expression or other data. Recognized problems in data mining included lack of use, thus far, of Bayesian approaches, the often temporal nature of the data being studied, how to interpret outliers and the ability to use background knowledge to strengthen the predictive outcome. However, it should be noted that Ron Taylor (National Cancer Institute, National Institutes of Health, Bethesda, MD, USA) described a Bayesian similarity measure applied to different experiments in a large array database. Given the importance of the underlying database to data mining, several posters and presentations described new integrative approaches for accessing data from different biological sources, with emphasis on linking gene identification to available functional information. Rolf Apweiler (EMBL–EBI, Hinxton, UK) described InterPro, an integrated documentation resource for protein families, domains and functional sites. Importantly, there was a follow-on conference for those interested in standards for the storage of microarray data. As someone involved in applying improved and consistent annotation to the 27 years of legacy macromolecular structure data as found in the Protein Data Bank, it would be useful to get the storage of array data right, from the beginning. Already, there might be a divide between what the European Biotechnology Institute is spearheading and the recent indications from the US National Center of Biotechnology Information. Finally, there were talks and posters on approaches to the visualization of massive datasets. David Gilbert (City University, London, UK) presented particularly exciting work, beyond the use of unrooted trees, for the display of pairwise comparison data; he and his colleagues use two new threedimensional (3D) clustering algorithms for the visualization of a 3D space. 229 TTEC June 2000 4/5/00 11:09 am Page 230 FORUM The conference (http://industry. ebi.ac.uk/datamining99) highlighted for me that at the moment we are stumbling, rather than dancing, but I have no doubt that there will be more dances, and new biology will be forthcoming as we learn the steps. As the Kinks say, ‘Come dancing’. ver the past few years, the O field of pharmacogenomics has emerged, bringing together novel techniques that might transform the way in which we go about treating and preventing disease, from the discovery of new drugs to the tailoring of medical therapy to the individual. Pharmacogenomics was the subject of a recent conference*, the programme of which was divided into three sections: (1) examining the latest developments in the pharmacogenomics revolution, (2) understanding the implications of the SNP Consortium, and (3) examining the long-term regulatory impact of pharmacogenomics. Latest developments Klaus Lindpainter (Roche Genetics, Basel, Switzerland) gave an excellent introductory talk on transforming the future of pharmaceutical R&D. He began by focusing on the current position of the pharmaceutical industry: there are either none or ineffective drugs for many diseases, there are substantial inter-individual differences in drug efficiency and, worryingly, the incidence of adverse events is significant. It is informative to consider these areas and, in particular, why do people respond to drugs in an individual-dependent manner? Günther Heinrich (Epidauros Biotechnologie AG, Bernried, Germany) stressed that the underlying reason for the current problems of drug development and therapy is the genetic diversity of Homo sapiens. Individual tailoring of health care Klaus Lindpainter (Roche Genetics) discussed the goal of pharmacogenomics: the tailoring of medicine to *The 2nd Annual Pharmacogenomics Event was held in London, UK, 18–20 January 2000. 230 San Diego Supercomputer Center and Department of Pharmacology, University of California, San Diego, CA 92093, USA. (E-mail: [email protected]) References 1 DeRisi, J.L. et al. (1997) Exploring the metabolic and genetic control of gene expression Evolution, not revolution Meeting report on a genomic scale. Science 278, 680–686 2 Chu, S. et al. (1998) The transcriptional program of sporulation in budding yeast. Science 282, 699–705 3 Spellman, P.T. et al. (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell 12, 3273–3297 Philip E. Bourne the individual, noting that this system would be probabilistic, rather than deterministic. Particularly for complex diseases, many gene variants will be involved, in addition to environmental factors. One point in particular prompted the title of this report, that is, that more genetic testing will not provide a paradigm shift, there will be no quantum leap, but incremental progress will hopefully be made; the next few years will see pharmacogenomics undergoing evolution, rather than a revolution. Importantly, the issues regarding the public were discussed, and it is essential that these are focused on. They include widespread concern that genetic information must be used appropriately for the benefit of humankind, confidentiality and ownership, particularly with regard to the necessity of a legal framework to protect individuals and to enable the legitimate and beneficial use of genetic information. George Poste (SmithKline Beecham, UK) also discussed the evolution of rational health care, that is, the design of increasingly rational therapeutics, with the focus on the genetic background of the patient (recognizing the effect of individual variation on their response to therapy), and the preventive treatment involving presymptomatic and pre-dispositional (also noting that these are probabilistic not absolute risks) counselling, with emphasis on the importance of regulatory issues, the inadequate scale of genetic counselling, and the issue of discrimination in insurance and employment. He forecasted that individual information will be contained on ‘smart cards’, and that a convergence between medicine and computing is urgently required. Gualberto Ruaño (Genaissance Pharmaceuticals, CT, USA) presented the role of Genaissance Pharmaceuticals in connecting genetics to the response of individuals to clinical therapy. A closer look at the development of clinical trials is required; the role of genetic variability in the success or failure of drugs to progress from Phase II to Phase III of clinical trials might previously have been underestimated. Herbert Schuster (Infogen, Berlin-Buch, Germany) discussed the next ‘loop’ after the sequencing of the entire human genome, the identification of potentially clinically relevant novel drug targets, and (ultimately) the development of drugs that are efficient, specific, universal, tolerable and free of charge. The SNP Consortium This fundamental project was presented by Arthur Holden (the SNP Consortium, USA); briefly, the SNP Consortium is a non-profit organization that is funded by the contributions of its members. Its aim is to identify and map human singlenucleotide polymorphisms (SNPs), and for this database to be in the public domain. The key objectives include the creation of the highestquality SNP map available, to identify a minimum of 300 000 SNPs, to map at least 170 000 SNPs and to maximize public accessibility. It is an ambitious project involving pharmaceutical companies, academic centres and charities. Amalgamation of specific skills An interesting session was held on the importance of alliances and collaborations in securing the integration of pharmacogenomics into drug-development programmes. In his talk, Michael Murphy (Pharmacogenomics Services, La Jolla, CA, USA) presented a case study that clearly demonstrated the result of pharmaceutical companies liaising with pharmacogenomics specialists. Claire Allan (Glaxo Wellcome, UK) continued on this theme, discussing 0167-7799/00/$ – see front matter © 2000 Elsevier Science Ltd. All rights reserved. PII: S0167-7799(00)01450-5 TIBTECH JUNE 2000 (Vol. 18)