Download Supervised and unsupervised data mining techniques for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Clusterpoint wikipedia , lookup

Big data wikipedia , lookup

Data Protection Act, 2012 wikipedia , lookup

Data center wikipedia , lookup

Data model wikipedia , lookup

Operational transformation wikipedia , lookup

Data analysis wikipedia , lookup

Forecasting wikipedia , lookup

Information privacy law wikipedia , lookup

Database model wikipedia , lookup

3D optical data storage wikipedia , lookup

Data vault modeling wikipedia , lookup

Business intelligence wikipedia , lookup

Data mining wikipedia , lookup

Transcript
CDD3-6_Advertorial.qxd
05/06/2003
13:47
Page 2
Supervised and unsupervised data mining
techniques for the life sciences
Susie Stephens* & Pablo Tamayo#*
and #Whitehead Institute/MIT, USA
*Oracle
Advances in technology and automation in life sciences research have caused a significant increase
in data volumes. Scientists need to adopt new in silico techniques to extract maximal knowledge and
make better use of data. Data Mining is an approach that can help uncover new insights in a variety
of scientific and commercial domains.
Oracle is best known for having a database that can manage
large quantities of data, store a variety of data types including
chemical structures, XML and images, and for being able to
access distributed data whether it is in relational or flat file
format. However, scientists are less aware of the range of analytical capabilities built into the database, which include statistics, online analytical processing, sequence matching and
alignment, text mining, data mining and pattern recognition.
Oracle is providing analytics inside the database to allow users
to build tightly integrated analytical pipelines, which also
enables the data to remain in a secure environment, and eliminates the performance overhead of moving large data sets.
This article will focus on results that Oracle Data Mining
Technology has obtained in the life sciences using supervised
(Naïve Bayes, Adaptive Bayes Networks and Attribute
Importance) and unsupervised learning techniques (Association
Rules, Hierarchical k-means Clustering and O-Cluster). With
appropriate data preparation and strong algorithms, data mining
can produce relevant analysis results and provide novel insights
of high scientific value.
In unsupervised learning, or clustering, the goal of the analyses is to uncover trends, correlations, or patterns, and no
assumptions are made about the structure of the data. In this
context, data mining algorithms are used to find clusters based,
on multiple scenarios, such as how close a set of biological
samples are to each other using a correlation, distance or similarity function. For example, if data is collected about various
genes that are expressed in tumor samples, unsupervised learning algorithms can cluster the samples into groups based on the
similarity of their aggregate expression profiles.
This technique has the advantage of uncovering unanticipated relationships or known phenotypes of functional groups.
ALL
p
AML
Sample
Figure 1. Glutathione S-transferase expression in ALL/AML leukemia.
34
www.currentdrugdiscovery.com
June 2003
ORACLE_AD.qxd
03/06/2003
12:18
Page 1
For FREE information Write in CD36_35
CDD3-6_Advertorial.qxd
05/06/2003
13:47
Page 4
One challenge of supervised learning is the identification of
Once the statistical significance of a cluster has been deterthe best features to input into the classifier. Oracle Data Mining
mined, the results frequently become the starting point for
supports feature selection through its Attribute Importance
further validation and research. Clustering is very useful but its
facility. In the above example, Attribute Importance can be used
results have to be interpreted to be able to assign them biological importance. We will now survey some applications of
to select the subset of genes that would most likely be useful in
discriminating the types of cancer. This process is usually conOracle Data Mining in a variety of molecular pattern recognition
problems.
sidered to be a pre-processing step in classification, because
most supervised algorithms work optimally with smaller data
The Hierarchical k-means Clustering algorithm was used to
sets. In addition, the classification algorithms can internally
identify hematopoietic differentiation in HL-60 cells by analyzselect the variables or
ing gene expression profiles of
higher predictive values.
6000 genes in a time series
Oracle
Data
Mining
solution
enables
experiment1. This experiment
Data Mining tools help
researchers to discover new insights in their
scientists to extract trends,
was chosen because differenpatterns and knowledge
tiation is predominantly condata in an integrated, powerful, secure and
from the many rapidly
trolled at the transcription
scalable environment...
expanding and diverse data
level, and blocks in the
sets. The comprehensive
developmental program likely
range of embedded algorithms in the Oracle Data Mining soluunderlie the pathogenesis of leukemia. Using this approach, it
tion enables researchers to discover new insights in their data in
was possible to identify genes that were upregulated, downregan integrated, powerful, secure and scalable environment proulated, and intermediately induced.
vided by the Oracle RDBMS. When the capabilities of Oracle
In supervised learning or class prediction, knowledge of a
Data Mining are combined with the ability of the RDBMS to
particular domain is used to help make distinctions of interest.
access, pre-process, retrieve and analyze data, as well as share
In life sciences, analyses tend to involve selecting the features
results, they create a very powerful platform for data analysis
most correlated with a phenotypic distinction. The features are
and knowledge management in life sciences.
then used as the input to a classification algorithm that uses
known sample labeling to build a model, so that future
unknown samples can be classified. For example, a model
could be built to identify which sub-type of cancer a patient has
based upon a subset of expressed genes that distinguish the
cancer types of interest. Supervised learning classifiers can be
Susie Stephens
very accurate in molecular classification, especially if a large
Oracle Corp
number of high quality samples are used to train the model.
10 Van de Graaff Drive
In this context, Adaptive Bayes Networks were used to idenBurlington
tify new classes of leukemia by analyzing gene expression data
2
MA 01803
from DNA microarrays with 7129 genes . Trained pathologists
USA
are able to distinguish between lymphoblastic leukemia (ALL)
and acute myeloid leukemia (AML), but there is no general
approach for assigning tumors to known classes. The model
was built by training it with 38 samples, and the test set contained 35 samples. The algorithm was able to distinguish
between AML and ALL (Figure 1). In a similar fashion Support
Email: [email protected]
Vector Machines have been used to accurately classify primary
3
URL: http://otn.oracle.com/industries/life_sciences
tumors into 14 distinct types .
1Tamayo P et al (1999) Interpreting patterns of gene expression with self-organizing maps: Methods and
application to hematopoietic differentiation. Proceeds of the National Academy of Science 96:2907-2912.
2Golub T et al (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286:531-537.
3Ramaswamy
S et al (2001) Multi-class cancer diagnosis using tumor gene expression signatures. Proceeds of
the National Academy of Science 98:15149-15154.
36
www.currentdrugdiscovery.com
June 2003