Download Supervised and unsupervised data mining techniques for

CDD3-6_Advertorial.qxd 05/06/2003 13:47 Page 2 Supervised and unsupervised data mining techniques for the life sciences Susie Stephens* & Pablo Tamayo#* and #Whitehead Institute/MIT, USA *Oracle Advances in technology and automation in life sciences research have caused a significant increase in data volumes. Scientists need to adopt new in silico techniques to extract maximal knowledge and make better use of data. Data Mining is an approach that can help uncover new insights in a variety of scientific and commercial domains. Oracle is best known for having a database that can manage large quantities of data, store a variety of data types including chemical structures, XML and images, and for being able to access distributed data whether it is in relational or flat file format. However, scientists are less aware of the range of analytical capabilities built into the database, which include statistics, online analytical processing, sequence matching and alignment, text mining, data mining and pattern recognition. Oracle is providing analytics inside the database to allow users to build tightly integrated analytical pipelines, which also enables the data to remain in a secure environment, and eliminates the performance overhead of moving large data sets. This article will focus on results that Oracle Data Mining Technology has obtained in the life sciences using supervised (Naïve Bayes, Adaptive Bayes Networks and Attribute Importance) and unsupervised learning techniques (Association Rules, Hierarchical k-means Clustering and O-Cluster). With appropriate data preparation and strong algorithms, data mining can produce relevant analysis results and provide novel insights of high scientific value. In unsupervised learning, or clustering, the goal of the analyses is to uncover trends, correlations, or patterns, and no assumptions are made about the structure of the data. In this context, data mining algorithms are used to find clusters based, on multiple scenarios, such as how close a set of biological samples are to each other using a correlation, distance or similarity function. For example, if data is collected about various genes that are expressed in tumor samples, unsupervised learning algorithms can cluster the samples into groups based on the similarity of their aggregate expression profiles. This technique has the advantage of uncovering unanticipated relationships or known phenotypes of functional groups. ALL p AML Sample Figure 1. Glutathione S-transferase expression in ALL/AML leukemia. 34 www.currentdrugdiscovery.com June 2003 ORACLE_AD.qxd 03/06/2003 12:18 Page 1 For FREE information Write in CD36_35 CDD3-6_Advertorial.qxd 05/06/2003 13:47 Page 4 One challenge of supervised learning is the identification of Once the statistical significance of a cluster has been deterthe best features to input into the classifier. Oracle Data Mining mined, the results frequently become the starting point for supports feature selection through its Attribute Importance further validation and research. Clustering is very useful but its facility. In the above example, Attribute Importance can be used results have to be interpreted to be able to assign them biological importance. We will now survey some applications of to select the subset of genes that would most likely be useful in discriminating the types of cancer. This process is usually conOracle Data Mining in a variety of molecular pattern recognition problems. sidered to be a pre-processing step in classification, because most supervised algorithms work optimally with smaller data The Hierarchical k-means Clustering algorithm was used to sets. In addition, the classification algorithms can internally identify hematopoietic differentiation in HL-60 cells by analyzselect the variables or ing gene expression profiles of higher predictive values. 6000 genes in a time series Oracle Data Mining solution enables experiment1. This experiment Data Mining tools help researchers to discover new insights in their scientists to extract trends, was chosen because differenpatterns and knowledge tiation is predominantly condata in an integrated, powerful, secure and from the many rapidly trolled at the transcription scalable environment... expanding and diverse data level, and blocks in the sets. The comprehensive developmental program likely range of embedded algorithms in the Oracle Data Mining soluunderlie the pathogenesis of leukemia. Using this approach, it tion enables researchers to discover new insights in their data in was possible to identify genes that were upregulated, downregan integrated, powerful, secure and scalable environment proulated, and intermediately induced. vided by the Oracle RDBMS. When the capabilities of Oracle In supervised learning or class prediction, knowledge of a Data Mining are combined with the ability of the RDBMS to particular domain is used to help make distinctions of interest. access, pre-process, retrieve and analyze data, as well as share In life sciences, analyses tend to involve selecting the features results, they create a very powerful platform for data analysis most correlated with a phenotypic distinction. The features are and knowledge management in life sciences. then used as the input to a classification algorithm that uses known sample labeling to build a model, so that future unknown samples can be classified. For example, a model could be built to identify which sub-type of cancer a patient has based upon a subset of expressed genes that distinguish the cancer types of interest. Supervised learning classifiers can be Susie Stephens very accurate in molecular classification, especially if a large Oracle Corp number of high quality samples are used to train the model. 10 Van de Graaff Drive In this context, Adaptive Bayes Networks were used to idenBurlington tify new classes of leukemia by analyzing gene expression data 2 MA 01803 from DNA microarrays with 7129 genes . Trained pathologists USA are able to distinguish between lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), but there is no general approach for assigning tumors to known classes. The model was built by training it with 38 samples, and the test set contained 35 samples. The algorithm was able to distinguish between AML and ALL (Figure 1). In a similar fashion Support Email: [email protected] Vector Machines have been used to accurately classify primary 3 URL: http://otn.oracle.com/industries/life_sciences tumors into 14 distinct types . 1Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceeds of the National Academy of Science 96:2907-2912. 2Golub T et al (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:531-537. 3Ramaswamy S et al (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proceeds of the National Academy of Science 98:15149-15154. 36 www.currentdrugdiscovery.com June 2003

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Supervised and unsupervised data mining techniques for