Download UNIVERSITY OF SOUTH AUSTRALIA

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
UNIVERSITY OF SOUTH AUSTRALIA
Assignment Cover Sheet – Internal
An Assignment cover sheet needs to be included with each assignment. Please complete all details clearly.
If you are submitting the assignment on paper, please staple this sheet to the front of each assignment. If you are
submitting the assignment online, please ensure this cover sheet is included at the start of your document. (This is
preferable to a separate attachment.)
Please check your Course Information Booklet or contact your School Office for assignment submission locations.
Name: Li Yi
Student ID
1
0
0
0
5
0
3
3
3
Email: [email protected]
Course code and title: CIS Research Methods INFT 4017
School: CIS
Program code: LHCP
Course Coordinator: Dr. Ivan Lee
Tutor: Dr. Ivan Lee
Day, Time, Location of Tutorial/Practical: 9:10am~11:00am Wednesday, GP1-09, MLK
Due date: 15th June, 2008
Assignment number: Assignment 2a
Assignment topic as stated in Course Information Booklet:
Assignment 2a
Further Information: (e.g. state if extension was granted and attach evidence of approval, Revised Submission Date)
None
I declare that the work contained in this assignment is my own, except where acknowledgement of sources is made.
I authorise the University to test any work submitted by me, using text comparison software, for instances of plagiarism. I
understand this will involve the University or its contractor copying my work and storing it on a database to be used in
future to test work submitted by others.
I understand that I can obtain further information on this matter at
Note: The attachment of this statement on any electronically submitted assignments will be deemed to have the same
authority as a signed statement.
Date: 15th June, 2008
Signed: Li Yi
Date received from student
Recorded:
Assessment/grade
Assessed by:
Dispatched (if applicable):
School of Computer and Information Science
University of South Australia
Building rule-based models in Microarray data
Research Proposal
By
Student Yi LI
Student ID 100050333
Program LHCP
Supervisor Associate Professor Jiuyong LI
Date 15th June 2008
Disclaimer
I declare that all the following to be my own work, unless otherwise referenced, as defined by the
University of South Australia’s policy on plagiarism.
Yi LI
Date: 15th June 2008
Table of Contents
1.0
Introduction ............................................................................................................................... 1
1.1
1.2
2.0
1.1.1
Microarray Rationale........................................................................................ 1
1.1.2
Data Representation .......................................................................................... 2
Research Motivation ...................................................................................................... 3
1.2.1
Current data preprocessing in Microarray data ............................................ 3
1.2.2
Research Question and hypothesis................................................................... 5
Literature Review ...................................................................................................................... 6
2.1
2.2
3.0
Background ..................................................................................................................... 1
Literature review in Machine Learning ....................................................................... 6
2.1.1
Decision tree induction ...................................................................................... 6
2.1.2
Ensemble Learning............................................................................................ 9
2.1.3
Bagging ............................................................................................................. 11
2.1.4
Boosting ............................................................................................................ 11
2.1.5
Random Forests ............................................................................................... 12
2.1.6
Optimal Rule based Classifier ........................................................................ 12
Literature review in statistical fields .......................................................................... 13
2.2.1
Feature selection methods............................................................................... 13
2.2.2
Discretization methods .................................................................................... 14
Research Design ....................................................................................................................... 16
3.1
Hardware and software ............................................................................................... 16
3.2
Research Methodology ................................................................................................. 16
3.2.1
Problem definition ........................................................................................... 16
3.2.2
Research steps .................................................................................................. 17
3.3
Expected outcomes ....................................................................................................... 18
3.4
Contributions ................................................................................................................ 18
4.0
Timetable ......................................................................................................................19
5.0
Summary .................................................................................................................................. 20
6.0
Reference List .......................................................................................................................... 21
7.0
Bibliography ............................................................................................................................. 23
8.0
Appendices A............................................................................................................................ 35
Appendices B ............................................................................................................................ 36
1.0
Introduction
With the great achievements in genome sequencing technology, scientists now have the ability to
identify most of the genes. However, without an effective DNA sequencing technology, current
genome sequencing is just transfer DNA into electronic format. Scientists hope to understand the
functionalities of genes, especially how they relate to certain diseases.
The development of DNA microarray technology makes it possible for scientists to make snapshots of
gene expressions in a single experiment. Microarray technology (Duggan, Bittner, Chen, Meltzer &
Trent 1999, Cheung, Morley, Aguilar, Massimi, Kucherlapati & Childs 1999) are also refer to DNA
microarrays, DNA arrays, DNA chips, and gene chips (Causton, Quackenbush et al. 2003).
Microarrays are effectively transforming a living cell from a black box into a transparent box. They
allow scientists to identify the genes that are expressed in different cell types, thus to study how their
expression levels change in different developmental stages or disease states (Causton, Quackenbush et
al. 2003).
1.1 Background
Microarray technology has been successfully widely used to produce gene expression data that can
reveal important information of how genes work and their relationships with some certain diseases.
However, to transform these data into knowledge level is not easy work. Advanced statistical tools
and computational technologies are needed to analyse such huge amounts of complex data.
1.1.1 Microarray Rationale
There is another term often used in bioinformatics: gene expression data. The genetic information of
cellular organisms is stored in a long sequence of four different deoxyribomucleotides (A, G, C and
T). These nucleotides are the DNA molecules that compose the genome of an organism (McLachlan,
Do et al. 2004). The genome contains segments of DNA that used to encode genes. The information
which uniquely identifies a living cell is transcribed from DNA into messenger RNA (mRNA), and
then translated to build proteins. The whole process is called gene expression(Causton, Quackenbush
et al. 2003). Figure 1.1 shows the process of how DNA is used as a template to form proteins. A
diagram with more detailed information is provided in Appendix A.
According to Causton et al (2003, p.3), since “the relationship between mRNAs and the genes that
encode them can be readily be identified, based on the relationship between their sequences”, this
Yi LI
-1-
property is exploited in microarray experiments, and is considered as the rationale of microarray
technology.
DNA
Transcription
(segments of DNA, ’genes’, are
mRNA
(mRNA
used as templates for DNA
abundance
detected
using microarrays)
synthesis)
Translation
Regulation of gene
expression
Protein
Folding
Cell structure
Replication
Protein
Protein
Repair
function
structure
Metabolism
Figure 1.1: The information transfer between DNA, mRNA and protein
(Causton, Quackenbush et al. 2003)
1.1.2 Data Representation
A microarray is typically a glass or polymer slide, onto which DNA molecules are attached at fixed
locations called spots or features (Causton, Quackenbush et al. 2003). It contains tens of thousands of
spots on an array, and again each spot may contain millions of genes samples. In practice, the spots
are printed on the microarrays by a robot or jet.
As it is pointed out by Causton et al (2003, p.40), in glass slide DNA microarray experiments, mRNA
from the cells and tissues of interest are used to generate first-strand DNA (cDNA) labeled with
spectrally distinguishable fluorescent dyes such as Cy3 and Cy5. Regardless of approach, the arrays
are then scanned and generate for query and control samples, typically as 16-bit TIFF images.
These images are subsequently been analyzed to identify the spots and to measure the fluorescence
intensities for each feature (Causton, Quackenbush et al. 2003). Some commercial software like
Yi LI
-2-
Affymetrix GeneChips TM provides effective tools for assistance. With the identification and intensity
measurement done, scientists usually conduct a series of normalization work, such as linear regression
and mean log centring, to make the results in similar format to make effective comparisons.
The whole process of a microarray experiment is usually divided into two stages: transformation of
the raw data into a gene expression matrices, and analysis of the gene expression matrices. The term
Gene Expression Matrices is referred in both stages.
Gene Expression Matrix Gene expression matrix is a matrix with rows and columns. According to
Causton et al (2003, p.71), rows in a gene expression matrix represent genes, columns represent
experimental conditions or samples, and the values at each position in the matrix characterise the
expression level of the particular gene under the particular experimental condition.
The expression levels of genes are represented in either absolute levels (in abstract units), or relative
values transformed into logarithms (Causton, Quackenbush et al. 2003). In the latter case, the original
gene expression values are usually converted into ratios or log ratios, with the information about
absolute values lost (Causton, Quackenbush et al. 2003). For example, ratio values for 400/200, 40/20,
and 4/2 return same results. Most gene expression data analysis algorithms assume that the gene
expression levels are represented as one numerical value for each expression level.
Vector Space Gene expression values can be represented as a set of their corresponding vectors in a
multidimensional space. By doing so is aim at to effectively identify and measure similarity and
distance between those genes. Many feature selection algorithms require values are in vector format.
1.2 Research Motivation
After representing gene expression data into either gene expression matrix or vector spaces,
many computational and statistical data analyzing techniques can be applied to get insight
into the potential relationships hided in the huge amount of gene expression data.
1.2.1 Current Data Preprocessing in Microarray Data
By transforming gene expression data into vector space format, it allows scientists to use data analysis
methods and to visualise different data transformations in the respective vector space. Gathering,
organising, and preparing data for statistical analysis is referred to collectively as preprocessing
Yi LI
-3-
(Berrar, Dubitzky et al. 2003). Due to the nature of biological gene structure, microarray data comes
with the difficulties for scientists: high dimensional and small number of samples.
Currently, researchers use feature selection methods to reduce the dimensions in gene expression data.
Those methods are including Pearson correlation coefficient (PCC), Chi-square, and mutual
information. The literature review of feature selection methods will be discussed in the next section.
As it is mentioned above, the gene expression data is usually represented as a series of continuous
numbers. This brings difficulty to scientists that for two seasons. Firstly, some data mining methods,
e.g. association rule mining, require data in a discretized form (Han et al, 2006). Secondly, categories
are closer to a knowledge-level representation than continuous ones. So a proper data discretization
method is needed for microarray data analysing for understandable results.
A number of discretization methods have been applied to microarray data since the late 1990s. For
example, there are “mid-ranged” method, “max minus x%” method and “x% of highest value”
method. Like other discretization methods, those three methods are relying on statistical parameters,
such as mean, median, and standard deviation. Usually gene expression data are categorized as: i)
Under-expressed; ii) balanced; and iii) over-expressed. The literature review of those three methods is
in the next section.
The current used discretization methods are great to some extent. The past researchers use them to
categorize gene expression data for future studies. However, with the development of the technology
and the increasing needs of better medical care, the analysing of microarray data requires a better
insight. The limitations occurred in current discretization methods become the bottleneck of the
development of the microarray data analysing.
The limitations occurred in those methods are various. In “Max minus x%” method, if the maximum
value in the data set is turn out to be a noise, then the whole discretization will be wrong by based on
a wrongly defined maximum value.
The main limitation in those methods is that using different methods, some values are categorized
from mixed categories. From Appendix B, the data points near the solid line have the risk of being
mixed categorized by different methods. If those mixed categorized data points are used to build
models, it will cause errors in future predictions. An arbitrary discretization cut point may lead to
randomly categorize points whose values are around boundaries.
Yi LI
-4-
On the other hand, the data points far away from the boundaries are clearly categorized. If only the
data points that are clearly categorized are used for building models, and those data points that are
between two distinct categories are excluded, the models would be more accurate.
Thus, an optimal discretization method is needed to clearly categorize uncertain values, which will be
discarded when build models.
1.2.2 Research Question and Hypothesis
Based on the discussion above, a question is raised.
Research question: How to build a rule-based model in microarray data with proper data
discretization?
Proposed hypothesis: A model built by discarding uncertain discretized values in microarray
data will return better accuracy.
In the next section, literature review on machine learning methods which are used in building up
classification models will be reviewed. In addition, some statistical methods used to do feature
selection and similarity comparison among gene vectors are reviewed.
Yi LI
-5-
2.0 Literature Review
Microarray study is a young, but rapidly maturing field, and it has attracted attentions from different
fields, such as biology, medical, mathematical, statistical, machine learning and computer science.
The potential complexities of microarray data are spawning rich statistical and computational
literature. Researchers from different areas are devoted themselves into the exploration of extracting
information from raw gene expression data.
This section will go through all methods that involved in the proposed honours research work. All
methods are presented by abstracting the paper published by their original developer(s), including
Rose Quinlan (1986a) for decision tree induction, Quinlan (1986b) for decision tree simplifying, Leo
Breiman (1996) for bootstrap aggregating, Breiman (2001) for random forest, Freund Y. and R. E.
Schapire (1997) for boosting algorithm, and J. Li (2006a) for optimal rule based classifier, Li (2006b)
for robust rule based prediction.
2.1 Literature Review in Machine Learning
To transform gene expression data into knowledge level is not an easy task. It is more likely a
computational or statistical task. As a broad subfield of artificial intelligence, machine learning boots
up the speed of microarray data learning and processing. Machine learning, generally speaking, is to
design and develop a series of algorithms and techniques that allow computers to learn (Simon 1981).
Several machine learning algorithms have been developed and widely used.
2.1.1 Decision Tree Induction Learning
Decision tree induction is a widely used machine learning algorithm for classification. It is firstly
developed by J. R. Quinlan from University of Sydney, Australia. He published a series of paper on
decision tree algorithms from early 1980’s to 2002. The detailed information about his publications
on decision tree induction and machine learning are listed in annotated bibliography.
Yi LI
-6-
Figure 2.1: The TDIDT family (J.R. Quinlan, 1986)
Quinlan firstly developed ID3 algorithm for decision tree constructions. Along with the development
of techniques and increasing needs of high performance classification algorithms, the algorithms of
decision tree construction are upgraded to provide capabilities to handle more complex problems. The
most widely used algorithms are ID3, ASSISTANT and C4.5. The Top-Down Induction of Decision
Trees (TDIDT) family (shown in Figure 2.1) contains those algorithms to construct decision trees
developed since 1963.
According to (Quinlan 1986a), decision tree is constructed in a top-down recursive divided-andconquer manner. It classifies examples (instances) by sorting them down the tree from the root to
some leaf node, which provides the classification of the instance. Each node in the tree specifies an
attribute, and each branch descends from that node corresponds to one of the possible values for this
attribute (Quinlan 1986b). An instance is classified by starting at the root of the tree, after evaluating
the attribute specified by this node, then moving down the tree branch corresponding to the value of
the attribute in that instance.
In addition, attributes in the tree theoretically are categorical data. If the values of those attributes are
continues data, they should be discretized in advance (Han and Kamber 2006). Meanwhile, after
finishing, no instance is left out; and there are no remaining attributes require further partition (Han
and Kamber 2006).
There are many measures that can be used to select the best splitter attribute to denote each node.
Examples of impurity measures include (Quinlan 1986b):
c 1
Entropy (t) = -
 p(i | t ) log
i 0
Yi LI
2
p (i | t )
-7-
c 1
Gini (t) = 1-
 [ p(i | t )]
2
i 0
Classification error (t) = 1- max [ p (i | t )]
i
Where c is the number of classes, and 0log 2 0 = 0 in entropy calculations.
There are other kinds of impurity measures, such as information gain and Gini ratio. Impurity
measures are used to weight each attribute so that to pick out the attribute with the highest impurity
value. The attribute with the highest impurity value is selected as the best splitter to build a child node.
Decision tree construction algorithms have natural concurrency. Once a node is generated, all of its
children in the decision tree can be generated concurrently (Srivastava, Han et al. 1999). A decision
tree can be constructed quickly so that it does not cost much computational resources.
Decision tree is a non-parametric approach for building classification models (Tan, Steinbach et al.
2005). In other words, it does not require any prior assumptions regarding the type of probability
distributions satisfied by the class and other attributes (Tan, Steinbach et al. 2005).
In technical point of view, the computational resources for building up a decision tree are not
expensive. It makes it possible to quickly build up a decision tree for a large data set like gene
expression data.
The outcome of J48 in Weka is tree-structured. Branches are stands for conditions, nodes are stands
for attributes, and leafs are class labels. It is interpretable, and a decision tree can be re-presented as
sets of if-then rules to improve human readability (Quinlan, 1986b).
According to Tan P and Steinbach, et al (2005, p.169), decision tree algorithms are quite robust to the
presence of noise. It introduced methods to address overfitting problem by pruning those parts which
are constructed based on noise values.
The decision tree algorithms have been extensively tested in the past few years. However, the
limitations of decision tree algorithms still exist because of various reasons.
Firstly, the shape of the tree is highly irregular, and is determined only at runtime (Srivastava, Han et
al. 1999). Meanwhile, the size of the decision does not necessarily associated with the size of the data
set. For example, usually there are thousands of attributes in a microarray data set. To diagnose a
Yi LI
-8-
cancer, it is only determined by less than ten attributes. According to the J48 scheme, Weka will
weight the entropy of each attribute to select the best splitter. Another scenario is to apply a data preprocessing filter so that Weka may do feature selection before actually start building model. No
matter which method is used, the tree constructed is small-shaped. The reason is that there are only
less than ten critical attributes which may draw attention by Weka.
In comparison, if a data set contains only 20 attributes and most of them are critical to determine the
splitters. The size of the decision tree built for this data set is large (assume after pruning). The reason
is that each node is decided by the impurity measures (i.e. entropy and Gini index). The more critical
an attribute is, the higher possibilities it is selected as a splitter.
Secondly, the amount of work associated with each node also varies, and is data dependent
(Srivastava, Han et al. 1999). Hence any static allocation scheme is likely to suffer from major load
imbalance.
2.1.2 Ensemble learning
“Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the
same problem” (Z.H.Zhou 2008). Be different from ordinary single machine learning approaches,
which try to obtain one hypothesis from training data, ensemble learners try to build a set of
hypothesis and combine them to get better prediction results. To achieve high performance in
prediction, many researchers start to combine several classifiers together for ensemble learning. Leo
Breiman from Statistics Department, University of California, Berkeley proposed the concept of
ensemble learning in 1996.
The idea of ensemble learning is shown in Figure 2.2. Base learners are usually generated from
training data by its base learning algorithm, which can be a decision tree, neural network or other
kinds of machine learning algorithms. Basically, an ensemble can be constructed in three steps. Firstly,
a number of data sets are created based on the original data set. Secondly, a number of base classifiers
are produced, which can be generated in a parallel style or in a sequential style, where the generation
of a base classifier has influence on the generation of subsequent learners (Z.H.Zhou 2008). Thirdly,
those base classifiers are combined to use.
Yi LI
-9-
D
Original training data
Step 1:
Create Multiple
Data Sets
D1
D2
Dt 1
C2
Ct 1
Dt
Step 2:
Build
C1
Ct
Multiple
Classifiers
Step 3:
Combine
C*
Classifiers
Figure 2.2 Diagram of ensembles (Tan, Steinbach et al. 2005)
The error rate of ensemble learner is the accumulation of all weak learners. According to Breiman
(1996), an ensemble learner is workable if weak learners are independent to each other, and their bias
are lower than 0.5. For example, if one weak learner has 25 independent base learners, each with error
rate of   0.35 , then the ensemble learner built based on it will have the error rate of:
 25  i
25i
 1   
 0.06
i 13 

25
  i
This example is cited from Tan et al (2003, p.274). On one hand, it illustrates how the idea of
ensemble learning is booted. On the other hand, it states that with error rate higher than 0.5, the
ensemble learner will accumulate the error rate and work worse than single learner.
Yi LI
- 10 -
2.1.3 Bagging
Bagging, also called bootstrap aggregating, bootstrap sampling, is first developed by Leo Breiman in
1996. Breiman (1996) points that that bagging is well-known as a method for estimating standard
errors, bias, and constructing confidence intervals for parameters.
Bagging is usually obtained by sub-sampling the training data set with replacement, where the size of
a sample is as the same of the training data set (Breiman, 1996). It does not add or replace to the
original data set. It just grabs data from the original data set, and then put it into the sub data set which
is generated using the data from original data set. In the process of generating sub data sets, some data
may appear more than once, while some may not appear at all. According to Breiman (1996), each
sample has probability 1 1 / nn of being selected, and the probability that one data appears at least
once is 0.632. Averaging and majority voting are used to combine all classifiers, and the class with
most-voted is predicted (Breiman, 1996).
2.1.4 Boosting
Boosting is another widely used ensemble learning algorithm. It is firstly developed by Freund Y. and
R. E. Schapire from AT&T Labs in 1996. Boosting is an iterative procedure to adaptively change
distribution of training data by focusing more on previously misclassified records (Freund and
Schapire 1997). The main idea of boosting algorithm is to increase the weights of wrongly predicted
records, so that they have more chance to be selected in the subsequent round. It is aim at boosting
weak learners to strong learners by improving their accuracy rate in each round. Among the various
boosting algorithms applications, AdaBoost is the most used algorithm.
The algorithms are summarized like the following. Firstly, all N records are assigned equal weights
with 1/ N . Unlike bagging, weights in boosting algorithm may change at the end of each round. Then,
from the training data set Dt , the algorithm generates a base learner ht : X  Y . Next, it uses the
training examples to test ht . Records that are classified correctly will have their weights decreased,
while records that are wrongly classified will have their weights increased by:
( j 1)
i
w
Yi LI
wi( j )

Zj
 j

exp
 j

exp
if C j  xi   y i
if C j  xi   y i
- 11 -
where i is the index of the round, and Z j is the normalization factor (Freund & Schapire 1997). Thus,
an updated weight distribution D( t 1) is generated. Again the updated learner tests the data set with
updated weights. Such a process is repeated for T times, each of which is called a round.
2.1.5 Random Forests
Random Forests is a powerful new approach to data exploration, data analysis and predictive
modelling. It is firstly developed by Leo Breiman, the father of CART (Classification and Regression
Tree), at University of California, Berkeley in 2001.
It has its roots in CART, learning ensembles, and committees of experts. Random forests are a
combination of tree predictors that each tree depends on the values of a random vector sampled
independently, and with the same distribution for all trees in the forest (Breiman, 2001). The error rate
of the forest is limited since the number of trees is large. Each tree in the forest is randomly binary
split and independent. The accuracy rate of a forest depends on the strength of the individual trees in
the forest and the correlation between them (Breiman, 2001). Random Forests is robust with respect
to noise.
2.1.6 Optimal Rule based Classifier
The Optimal Rule based Classifier (ORC) is an optimal rule based classifier firstly developed by J. Li,
at University of Southern Queensland Australia in 2006. ORC is an optimal rule based classifier
which is developed by the motivation to in cope with the large amount of useless rules generated by
normal association rules algorithms. ORC is an efficient alternative to association rule discovery,
especially when the minimum support is low (Li 2006a).
Li (2006a) defines a rule with its both support and interestingness not less than given thresholds as a
strong implication. By following, Li (2006a) the general and specific relationships like this:
Given two rules P  c and Q  c , where P  Q . Then the latter is more specific than
the former and the former is more general than the latter.
The basic idea of ORC is that the removal of a specific rule from a rule set does not reduce the total
coverage of the rule set. The rule set with all non-interestingness, or specific rules removed, is an
optimal rule set.
Yi LI
- 12 -
Li (2006a) extents the definition by presenting his algorithm for the optimal rule discovery (ORD)
and optimality pruning. The main contribution of the development of the Optimal Rule based
Classifier is that it is significantly more efficient than association rule discovery independent of data
structure and implementation (Li, 2006a). In addition, it generates accurate rules that all satisfied
certain constraints even the minimum support is low.
2.2 Literature Review in Statistical field
As many researchers and scientists use statistical methods to do data preprocessing in Microarrays
studies, this section will give brief review on the literatures on data preprocessing methods, especially
those used for continuous numbers attributes discretization.
2.2.1 Feature Selection methods
Although the research experiments are not finally aim at comparing the feature selection methods, the
experiments cannot skip them because feature selection provide effective dimension reduction in gene
expression data and speed up the data analysis process. In this section, literatures in feature selection
will be briefly reviewed.
Pearson Correlation Coefficient
According to Causton (2003, p.85), Pearson correlation coefficient (PCC) is the widely used approach
to explore the similarity of genes and reduce the dimensions in gene expression data. PCC firstly
calculate the mean value for each gene expression profile. Next, it shifts each profile down by mean
centring. At last, calculate PCC as the cosine of the angle between the mean-centred profiles.
Given two expression profiles A and B with two samples in each. They can be represented in a


three-dimensional space as A  a1, a 2 , a3 and B  b1 , b2 , b3  .
Step 1: calculate the mean.
a  a1  a2  a3  / 3 and b  b1  b2  b3  / 3
Step 2: mean centring.
A  a1  a , a 2  a , a3  a  and B  b1  b , b2  b , b3  b 
Step 3: get PCC.
Yi LI
- 13 -
PCC 
n
AB
where A  B   ai  a   bi  b 
AB
i 1
In n dimensional space, covariance is calculated as below to reduce dimensions:
Cov A, B  
AB
n  1
2.2.2 Discretization methods
There are many papers published in the past few years that give overviews to discretization
techniques used in gene expression data. According to Becquet et al (2002), all quantitative values in
gene expression data have given rise to one Boolean value, that is, true (1) or false (0). Becquet et al
(2002) proposed three different discretization procedures, which are “the max minus x%”, “midranged”, and “x% of highest value” approaches. Assume a given expression data is denoted as d.
Max minus x%
“Max minus x%” method consists of identifying the maximum expression value (MaxValue). “Max
minus x%” defines a value of 1 when the expression value is greater than (MaxValue-x) /100.
Otherwise the expression of the value is assigned a value of 0. The default value for x is set to 25.
1 if d  ( MaxValue  25) / 100
v
0 if d  MaxValue  25 / 100
Mid-Ranged
Becquet et al (2002) also analyses the effect of a “mid-ranged” based cut-off approach. This approach
involves identifying the maximum (MaxValue) and minimum (MinValue) values in the expression
data. The mid-range value is set as being equidistant from these two numbers, that is, their arithmetic
mean (Becquet et al, 2002). All expression values below or equal to the mid-range value are set to 0,
and all values strictly above the mid-range are set to 1.
1 if d  MaxValue  MinValue  / 2
v
0 if d  MaxValue  MinValue  / 2
X% of highest value
Yi LI
- 14 -
“X% of highest value” approach involves identifying the highest 5% of the whole data set. The
expression data which in that interval are assigned to value 1 and the rest are set to 0. In the following
formula, the lowest value in the highest 5% of the whole data set is set to  .
1 if d  
v
0 if d  
Yi LI
- 15 -
3.0 Research Design
This section will go through the research methodologies that will guide the whole process of
experiments, including data collection, pre-processing, experiments conducting and re-testing.
3.1 Hardware and Software
This research does not require any special equipment. All methods implemented are to conduct on a
PC running Linux Red Hat 8.0 or Windows.
The software required for conducting the research experiments are:

Linux Server comes with Intel Core 2 Duo 3 GHz, 4G memory and 500G hard disk plus
500G external hard disk

Ubuntu version 8.04

Open SSH Secure Shell client, including secure file transfer client

R language programming tool, or

Matlab
For the network connection bandwidth, current fibre optic connection in UniSA Mawson Lakes
campus has the capability to provide efficient bandwidth for file transmission and uploading.
3.2 Research Methodology
A positivist qualitative methodology has been selected to achieve the major outcomes of this research
thesis. The methodology is focus on verifying the hypothesis proposed, which is “a model built by
discarding uncertain discretized values in microarray data will return better accuracy”. The proposed
hypothesis is aim at answering the research question “How to build a rule based model in microarray
data with proper data discretization?”
3.2.1 Problem Definition
By getting the past papers on discretization in microarray data reviewed, the existing shortcomings
and limitations in discretizing continuous values in large-scale microarray data are addressed. Since
the performance and accuracy of existing discretization methods are still in debate, an optimal or new
discretization method particularly for microarray is needed to get understandable results.
Yi LI
- 16 -
Current discretization methods may get some data points mixed from two categories. The proposed
research methodology is to combine the existing methods to obtain those data points that are mixed
from two categories. In addition, the proposed method is to overcome the problem by clearly denote
over-expressed and under-expressed values, and leave those uncertain points alone. If possible, the
proposed method will be simplified and extended.
The purpose of this project is to compare differences among different discretization methods. Thus
the bias cause by the different classification models and feature selection methods are not in the
boundary of this project.
3.2.2 Research Steps
This honours project consists of the following steps:
1. Collect 4~5 microarray data sets.
Most microarray data sets are restricted to open to public. Some microarray data sets are for research
purpose only. The quality of the data sets will affect the experiments results. Therefore, the
microarray data sets should be collected from certain authorized Universities or research institutions.
For example, the School of Medicine, Stanford University has done remarkable work in lung cancer
and liver tumour studies.
2. Data preprocessing.
This is the most important and time consuming step in this project. In data preprocessing step, the
following work needs to be done:

Data integration and format normalization – data sets that describe same domain of
information but obtained from different sources may vary in format and attribute names.
Firstly the various data sets should be integrated together according to unified attribute names.
It involves transform attribute names from probe names to a unified list of gene names.

Replication values removal – the duplicated values should be removed to get data redundancy.
The removal of duplicated data will not reduce the coverage of data.

Missing values handling – missing values in a data set should be handled, by either remove
them or using other methods, such as replace them by mean of other values, or replace them
by the most frequent values occurred. Missing values will affect the results of experiments,
since some data mining techniques are not robust in the presence of missing values.
Apart from those,
Yi LI
- 17 -

Discretization –this is what needs to be figured out. Which discretization method is the most
suitable one in microarray data analysis?

Feature selection – i.e. attribute subset selection. It is aim at selecting a minimum set of
features such that the probability distribution of different classes given the values for those
features is as close as possible to the original (Han et al, 2006).
3. Select classification models to build models.
As this project is not to compare the performance of classification models, it is just required to select
the most commonly used classification algorithms, such as C4.5 and ORC (Li, 2006a).
4. Get experiments results and analysis.
This step has not been reached so far. Currently, according to the methodology, after getting all
experiments results handy, they need to be ranked according to each discretization method. It is to
determine which discretization methods is the best, so the best methods with high accuracy rate
ranked to compare the performance.
5. Conclude the project.
This step is mainly thesis writing and experiments finalizing.
3.3 Expected Outcome
As mentioned in the previous section, the expect outcomes from the project are including:

A proposed discretization method, including algorithms, coding, and explanations

A hypothesis with verified based on the research so far

A written thesis
3.4 Contributions
The proposed discretization method will not only be a useful and powerful approach for data
discretizing in microarray data analysis, but also have contributions to allow more data mining
methods be used in microarray data, e.g. association rules and tree-based models.
In addition, the proposed discretization method will enhance the accuracy rate of the models that use
the new method, thus to improve the performance of classification and predictions of some diseases.
Last but not least, the enhancement in the performance of classification models may contribute to
medical and cancer diagnosis.
Yi LI
- 18 -
3.5 Limitations
Every new proposed algorithm requires tens of thousands of tests and evaluations. Due to the
restrictions of time and resources, this project may have the following limitations:
The data sets collected have the possibilities that they are not standard for the whole population.
Microarray slides are obtained under certain experimental conditions. One microarray sample is
obtained from one particular condition. However, that particular microarray slide cannot stand for the
whole gene expression results. If a experiment draws an conclusion only based on that particular
microarray slide, the result may be a random chance, since the sample is not big enough. In this
project, only a few microarray data sets are available. If a result is drew from a experiment with small
number of sample, it is not an convincing result or hypothesis.
The differentiations and bias cause by choosing different classification algorithms and feather
selection methods will affect the result to some certain degree. However, it cannot be overcame due to
the resource limitations.
Yi LI
- 19 -
4.0 Time Table
The time plan for doing research project in the remaining time is summarized as following:
Task name
Duration
Start date
Expected
Comments
Finish date
1
Research proposal
2wks
1 June 08
15 June 08
2
Data collection
0.5wks
1 July 08
5 July 08
3
Pre-processing
2.5 wks
8 July 08
25 July 08
Done
Tasks 3~5 may
be done
recursively
4
Data analysing
4 wks
26 July 08
25 Aug 08
5
Running
6 wks
27 Aug 08
8 Oct 08
6
Re-testing
1.5 wks
9 Oct 08
16 Oct 08
7
Thesis writing
2 wks
17 Oct 08
25 Oct 08
Do paralleling
with other tasks

Late June is revision time for exams of Study Period 2, 2008.

The date arrangement from July to Oct 2008 does not count in any public holidays and
weekends.

The workload assumed is 6 hours/day.

All files and data are required to do backup regularly to prevent any human mistakes,
equipment failure, and any unpredictable circumstances. In addition, it is needed to do
regularly backup to prevent project delay caused by information loss.
Yi LI

Weekly meeting with supervisor or research colleagues.

This time table may subject to adjust.
- 20 -
5.0 Summary
The title of this project is to build rule based models in microarray data. Microarray technology has
made it possible for biologists to effectively extract similar patterns of genes, and cluster genes with
similar functionalities and structures. Scientists hope to identify how gene expressed in different cell
types, thus to find out the relationship between cell’s development stages and diseases states.
However, to transform this big amount of data into knowledge-level is not an easy task.
With the help of data mining technologies, scientists could find out the relationships and patterns in
gene expression data, especially those which are signatures of some certain diseases. In addition, they
can also classify genes into classes using classification model, and make predictions for the next
coming instances.
In analyzing microarray data, data preprocessing is very important. Gene expression data has high
dimensions and small number of samples. Usually scientists use feature selection methods to reduce
the dimensions.
Gene expression data is usually represented as continuous numbers. Some data mining methods, e.g.
association rule mining, requires the data in discretized format. What’s more, categorized values are
more understandable than continuous values. Because of those two reasons, data discretization is
required in microarray data analysis.
Current used discretization methods have many limitations. For example, by using different methods,
some data points around boundary are categorized from mixed categories. If scientists use all data
points to build models, those values will increase the error rate of the model.
A hypothesis is proposed. A model built by discarding uncertain discretized values in microarray data
will return better accuracy.
A series of experiments will be conducted to verify this hypothesis. 4~5 microarray data sets will be
collected and a few feature selection methods are implemented to conduct experiments which will
approve if the hypothesis is correct or not.
Yi LI
- 21 -
Reference
Babu, M. M. 2004a, "Introduction to microarray data analysis." Computational Genomics: Theory
and Application, ed by RP Grant (Horizon Bioscience Norwich).
Babu, M. M. 2004b, “An introduction to Microarray data analysis” MRC Lab page, visited on 15 June
2008, <http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.>
Breiman, L. (1996). "Bagging Predictors." Machine Learning 24(2): 123-140.
Breiman, L. (2001). "Random Forests." Machine Learning 45(1): 5-32.
Becquet, C., S. Blachon, et al. (2002). "Strong-association-rule mining for large-scale geneexpression data analysis: a case study on human SAGE data." Genome Biology 3(12): 0067.10067.16.
Berrar, D. P., W. Dubitzky, et al. (2003). A practical approach to microarray data analysis. Boston ;
London, Kluwer Academic.
Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., & Childs, G. 1999, Marking
and reading microarrays’, Nature Genetics 21, 15-19.
Causton, H. C., J. Quackenbush, et al. (2003). Microarray/gene expressions data analysis : a
beginner's guide. Malden, MA, Blackwell Pub.
Freund, Y. and R. E. Schapire (1997). "A Decision-Theoretic Generalization of On-Line Learning and
an Application to Boosting." Journal of Computer and System Sciences 55(1): 119-139.
Han, J. and M. Kamber (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann.
Li, J. (2006a), ‘On optimal rule discovery’, IEEE Transactions on Knowledge and Data Engineering
18 (4), 460-471.
Li, J. (2006b), ‘Robust rule based prediction’, IEEE Transactions on Knowledge and Data
Engineering 18 (8), 1043-1054.
Yi LI
- 22 -
McLachlan, G. J., K. A. Do, et al. (2004). Analyzing microarray gene expression data. Hoboken, N.J.,
Wiley-Interscience.
McLachlan, G. J., K. A. Do, et al. (2004). Analyzing microarray gene expression data. Hoboken, N.J.,
Wiley-Interscience.
Madeira, S. C. and A. L. Oliveira An Evaluation of Discretization Methods for Non-Supervised
Analysis of Time-Series Gene Expression Data, Technical Report 42, INESC-ID, December 2005.
Pensa, R. G., C. Leschi, et al. (2004). "Assessment of discretization techniques for relevant pattern
discovery from gene expression data." Proceedings ACM BIOKDD: 24–30.
Quinlan, J. R. (1986a). Simplifying Decision Trees, Massachusetts Institute of Technology.
Quinlan, J. R. (1986b). "Induction of Decision Trees." Mach. Learn. 1(1): 81-106.
Simon, H. (1981). The Sciences of the Artificial, . Cambridge, MA, MIT Press.
Srivastava, A., E. H. Han, et al. (1999). "Parallel Formulations of Decision-Tree
Classification Algorithms." Data Mining and Knowledge Discovery 3(3): 237-261.
Tan, P. N., M. Steinbach, et al. (2005). Introduction to Data Mining, Addison-Wesley
Longman Publishing Co., Inc. Boston, MA, USA.
Z.H.Zhou (2008). "Ensemble learning." Encyclopedia of Database Systems.
Yi LI
- 23 -
Bibliography
1. Boulesteix, A. L., G. Tutz, et al. (2003). A CART-based approach to discover emerging
patterns in microarray data, Oxford Univ Press. 19: 2465-2472.
The authors, research at University of Munich, Germany, have been working on Cancer diagnosis
topics. Cancer diagnosis uses gene expression profiles, which require supervised learning and gene
selection methods. After using many suggested approaches, the authors find that the method of
emerging patterns (EPs) has the particular advantages of explicitly modelling interactions among
genes, which may improve classification accuracy. They introduce a CART-based approach to
discover EPs in microarray data. This tree-based method is computationally fast and intuitive and also
assigns statistical relevance to the identified patterns. The authors assess the performance of their
pattern search algorithm and classification procedure to simulated data and gene expression data from
colon and leukemia cancer experiments. Finally they find out the new approach provides a versatile
and computationally fast tool for elucidating local gene interactions and classification.
2. Kluger, Y., R. Basri, et al. (2003). "Spectral Biclustering of Microarray Data:
Coclustering Genes and Conditions." Genome Research 13(4): 703.
The authors, research at Yale University, USA, find out that the classification problems in DNA
expression are linked, and usually researchers are more interested in finding out “marker genes” that
are differently expressed in particular sets of conditions. The authors have developed a method that
simultaneously clusters genes and conditions, finding out distinctive “checkerboard” patterns in
matrices of gene expression data. The method introduced in this paper, spectral bi-clustering, is based
on the observation that checkerboard structures in expression data can be found in eigenvectors
corresponding to characteristics expression patterns across genes or conditions. The authors firstly
apply the singular value decomposition (SVD), coupled with integrated data normalization techniques.
The spectral bi-clustering is applied then to a publicly available cancer expression data sets, in order
to examine the degree to which the approach is able to indentify checkerboard (marker data)
structures.
3. McShane, L. M., M. D. Radmacher, et al. (2002). Methods for assessing reproducibility
of clustering patterns observed in analyses of microarray data, Oxford Univ Press. 18:
1462-1469.
Yi LI
- 24 -
The authors, research at National Cancer Institute, Biometric Research Branch, USA, find out that
cDNA microarray technology have made it possible to simultaneously interrogate thousands of genes
in a biological specimen. But according to their experiments and researches, they find that clustering
algorithms always detect clusters, even on random data, and it is easy to misinterpret the results
without some objective measure of the reproducibility of the clusters. To work it out, in this paper, the
authors present a series of statistical methods for testing for overall clustering of gene expression
profiles; and define interpretable measures of cluster-specific reproducibility that facilitate
understanding of the cluster structure. Then the authors apply these methods to cludidate structure in
cDNA microarray gene expression profiles obtained on melanoma tumors and on prostate specimens.
4. Tuzhilin, A. and G. Adomavicius (2002). "Handling very large numbers of association
rules in the analysis of microarray data." Proceedings of the eighth ACM SIGKDD
international conference on Knowledge discovery and data mining: 396-404.
In this paper, authors propose to use association rule discovery methods for determining associations
among expression levels of different genes. In their research, one of the main problems in discovery is
the scalability issue. Microarray usually contain very large amount of datasets, therefore, analysis of
such big amount of data may generate a large number of associations that can often be measured in
millions; what’s more, to process such big amount of algorithms may take long time, which may be
measured in weeks at least. The authors of this paper rise one method to enable biologists to evaluate
these very large numbers of discovered association rules in data mining processes. It is achieved by
providing several rule evaluation operators, e.g. rule grouping, filtering, and browsing, to allow
biologists to validate multiple individual gane regulation patterns at same time.
5. Wilson, D. L., M. J. Buckley, et al. (2003). New normalization methods for cDNA
microarray data, Oxford Univ Press. 19: 1325-1332.
In this paper, authors present two new normalization methods for cDNA microarrays. After the image
analysis technique has been implemented into use, some sort of normalization must be applied to the
microarrays before continue to detect differentially expressed genes. According to their study, they
find out that normalization removes biases towards one or other of the fluorescent dyes used to label
each mRNA sample, which may allow for proper evaluation of differential gene expression. the
outcomes of their study is that they extend the non-linear normalization techniques by firstly bringing
Yi LI
- 25 -
in a normalization method that deals with smooth spatial trends in intensity across microarrays; they
next deal with normalization of a new type of cDNA microarray experiment.
6. Li, J., X. Tang, et al. (2008). "A novel approach to feature extraction from classification
models based on information gene pairs." Pattern Recognition 41(6): 1975-1984.
Authors who conduct a series of experiments on DNA microarray with data mining technologies have
found out that one of the major challenges of analysing microarray data is how to extract and select
efficient features from it for accurate cancer classification. To improve the accuracy, authors in this
paper introduce a new feature extraction and selection method. This method works based on
information gene pairs, and authors use five public microarray data sets which demonstrate that the
feature subset selected by the proposed method performs well according to the expected outcomes.
After comparing with the results generated by using other methods, they confirm that the new method
they work on can improve the accuracy of cancer prediction, including breast cancer, adenocarcinoma
and myeloid leukemia.
7. Gregory, P.-S. and T. Pablo (2003). "Microarray data mining: facing the challenges."
SIGKDD Explor. Newsl. 5(2): 1-5.
Authors firstly agree to the fact that Microarrays is a revolutionary new technology with great
potential to provide accurate medical diagnostics, helping find the right treatment and cure for many
diseases and provide a detailed genome-wide molecular portrait of cellular states. However, authors
also point out several problems and issues which microarrays can get improved. For example, current
methods have problems or bugs in gene selection, classification, clustering and visualization. one
important goal of current and future computational analysis methods-short of reverse engineering the
entire cell circuitry, should be to reduce that search and help expose the most promising candidates,
i.e. gene, proteins, drugs etc; and although current methods already succeed, better accuracy, more
robust models and estimators are necessarily welcomed.
8. Hong, H., L. Jiuyong, et al. (2006). A comparative study of classification methods for
microarray data analysis. Proceedings of the fifth Australasian conference on Data mining
and analystics - Volume 61. Sydney, Australia, Australian Computer Society, Inc.
Yi LI
- 26 -
These authors research at University of South Queensland, conduct a series of experiments on DNA
Microarray technology using ten-fold cross validation tests. In order to compare the currently used
methods, such as SVMs, decision trees, Bagging, Boosting and Random Forest, they use LibSVMs,
C4.5, BaggingC4.5, AdaBoostingC4.5, and Random Forest on seven Microarray cancer data sets. The
results are indicating that all ensemble methods are outperform C4.5. All five ensemble learning
methods benefit from data pre-processing, including gene selection and discretization, as well as
classification accuracy.
9. Chaolin, Z., L. Xuesong, et al. (2006). "Significance of Gene Ranking for Classification
of Microarray Samples." IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(3): 312-320.
Authors of this paper pointed out that evaluating the statistical significance of the gene ranking is
important for understanding the results and for further biological investigations; however, they also
point out that this question has not been well addressed for machine learning methods in existing
works. To improve this problem, authors formulate it in the framework of hypothesis testing and
propose a solution based on re-sampling. R-test, which is proposed by authors, converts gene ranking
results into position p-values to evaluate the significance of genes. After testing three real microarray
data sets and three simulation data sets with support of vector machines, authors point out that the pvalues may help to enable scientists to analyse selection results by sophisticated multivariate methods
under the same statistical inference paradigm.
10. Li, S. and T. Eng Chong (2005). "Dimension Reduction-Based Penalized Logistic
Regression for Cancer Classification Using Microarray Data." IEEE/ACM Trans.
Comput. Biol. Bioinformatics 2(2): 166-175.
In this paper, authors present the use of penalized logistic regression for cancer classification using
microarray expression data. They introduce two dimension reduction methods, which are combined
with the penalized logistic regression to enhance the classification accuracy and computational speed.
They also choose two other machine learing methods to make comparison, which are support vector
machines and least-squares regression. The advantages of the two new methods are including the
explicitly output, selection of penalty parameters and components. They also discuss the application
of the methods for cancer classification.
Yi LI
- 27 -
11. Lei, Y. and L. Huan (2004). Redundancy based feature selection for microarray data.
Proceedings of the tenth ACM SIGKDD international conference on Knowledge
discovery and data mining. Seattle, WA, USA, ACM.
The two authors, who research at Arizona State University, work on redundancy based feature
selection for microarray data. In this paper, they point out one problem, which occurred in genes
discrimination, is the challenging that the large number of features (genes) and small sample size.
Traditional methods always handle this problem without pay attention to the high degree of
redundancy among the genes. The authors point out that to remove redundant genes among selected
ones can achieve a better representation of the characteristics of the targeted phenotypes and lead to
improved or higher classification accuracy. What’s more, in this paper, authors study the relationship
between feature relevance and redundancy; they also propose an efficient method that effectively
removes redundant genes. The results have been compared with the results of experiments which use
public microarray data sets.
12. Sung-Bae, C & Hong-Hee, W 2003, Machine learning in DNA microarray analysis for
cancer classification, Australian Computer Society, Inc., Adelaide, Australia.
The authors, who work on machine learning, find it can be necessarily implemented into DNA
microarray data sets analysis and play an important role in it. In this paper, they attempt to explore
many features and classifiers using three benchmark datasets to systematically evaluate the
performances of the feature selection methods and machine learning classifiers. Three benchmark
datasets are including Leukemia cancer dataset, Colon cancer dataset and Lymphoma cancer dataset.
Various techniques and classification algorithms have been involved in this project, which are
including cosine coefficient, and information gain for feature selection; k-nearest neighbour, support
vector machine and self-organizing map (SOM) are used for classification. According to these
experiments they have conducted, they point out that the ensemble learning algorithms with several
basis classifiers produces the best recognition rate.
13. Raymond, W, Hiroshi, M & Kiyoko, FA 2005, Cleaning microarray expression data
using Markov random fields based on profile similarity, ACM, Santa Fe, New Mexico.
Authors, who research at Kyoto University, Japan, proposes a method for cleaning the noise found in
microarray expression data sets. This method may improve the data pre-processing process in
Yi LI
- 28 -
microarray research today. This method introduced here is based on Markov random fields, or MRFs,
for data set. The cleaning process is guided by genes with similar expression profiles. Their method
consists of two steps. In the first step, the expression data is used to infer the profile similarity
between each and every gene; in the second step, these similarities are used to construct an MRF of
genes and their associated expression values.
14. Tara, M & Sanjay, C 2007, 'High Confidence Rule Mining for Microarray Analysis',
IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 4, no. 4, pp. 611-623.
In this paper, authors present an association rule mining method for mining high-confidence rules,
which describe interesting gene relationships from microarray data sets. They introduce a new family
of row-enumeration rule mining algorithms to emerge to facilitate mining in dense data sets. These
algorithms rely on pruning infrequent relationships, and are about to reduce the search space by using
the support measure. MAXCONF, which is a new method proposed by authors, is used to mine highconfidence rule from microarray data, and it is a support-free algorithm that directly uses the
confidence measure to effectively prune the search space. Finally three microarray data sets are used
in experiments to show that MAXCONF outperforms support-based rule mining in scalability and
rule extraction.
15. Robert, C & Alberto, R 2006, 'A Robust Procedure For Gaussian Graphical Model Search
From Microarray Data With <i>p</i> Larger Than <i>n</i>', J. Mach. Learn. Res., vol. 7,
pp. 2621-2650.
In this paper, authors consider limited-order partial correlations, these are partial correlations
computed on marginal distributions of manageable size. Compare to the prime objects of inference
full-order partial correlations, which are partial correlations between two variables given the
remaining ones, limited-order partial correlations may provide a set of rules that allow one to assess
the usefulness of these quantities to derive the independence structure of the underlying Gaussian
graphical model. They also introduce a novel structure learning procedure based on a quantity, which
is called non-rejection rate. Experiments have shown that the applicability and usefulness of the
procedure are demonstrated by both simulated and read data.
Yi LI
- 29 -
16. Hongxing, H, Huidong, J, Jie, C, Damien, M, Jiuyong, L & Tony, F 2006, Analysis of
breast feeding data using data mining methods, Australian Computer Society, Inc.,
Sydney, Australia.
In this paper, authors research at using data mining methods to analysis breast disease. They aim to
demonstrate the benefit of using data mining techniques on survey data where statistical analysis is
applied. Ranges of questionnaire have been sent out to collect data on deciding whether or not to
breast feed a new born baby. Many typical data mining methods and algorithms, such as decision
trees, regression approaches, and information gain, have been used to identify groups with high risk of
not breast feeding. The outcome of this study is not only to approve the results of survey is correct,
but also to suggest that using data mining approaches will be applicable to other similar survey data.
Data mining methods, which enable a search for hypotheses, may be used as a complementary tool to
traditional statistical analysis.
17. Yuchun, T, Yan-Qing, Z & Zhen, H 2007, 'Development of Two-Stage SVM-RFE Gene
Selection Strategy for Microarray Expression Data Analysis', IEEE/ACM Trans. Comput.
Biol. Bioinformatics, vol. 4, no. 3, pp. 365-381.
Authors of this paper point out that although there have been many data mining methods implemented
in data preparation step in cancer classification, the Support Vector Machine Recursive Feature
Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. Differentiate
from other related studies, authors of this paper study on a new two-step SVM-RFE algorithm, which
is designed to effectively eliminate most of the irrelevant, redundant, and noisy genes while keeping
information loss small. In the second stage, it will conduct a fine selection for the final gene subset.
The new method, according to the authors, overcomes the instability problem of the SVM-RFE to
achieve better algorithm utility. The new two-stage SVM-RFE is more accurate and reliable than the
traditional SVM-RFE.
18. Sandrine, D, Mark, JvdL, S, nd, z, K, Annette, MM, Sandra, ES & Siew Leng, T 2003,
'Loss-based estimation with cross-validation: applications to microarray data analysis',
SIGKDD Explor. Newsl., vol. 5, no. 2, pp. 56-68.
Authors of this paper research at loss-based estimation with cross-validation in applications to
microarray data analysis. They propose a unified loss-based methodology for estimator construction,
Yi LI
- 30 -
selection, and performance assessment with cross-validation. To be different from other traditional
methods, the parameter of interest is defined as the risk minimizer for a suitable loss function and
candidate estimators are generated using this loss function. Based on the definition of parameters,
cross-validation then is applied to select an optimal estimator among the candidates, and to assess the
overall performance of the resulting estimator. The use of the methodology can be the prediction of
biological and clinical outcomes.
19. Miguel, R, Rui, M, Paulo, M, Daniel, G-P & Florentino, F-R 2007, A platform for the
selection of genes in DNA microarraydata using evolutionary algorithms, ACM, London,
England.
Authors research on provide a platform for the selection of genes in DNA microarray data using
evolutionary algorithms. They rise the idea that to present a flexible framework to the task of feature
selection in classification of DNA microarray data. Evolutionary algorithms, with variable-sized set
based representations, are used to reduce the number of attributes. Three distinct classifiers, including
1-nearest neighbour, decision trees and SVMs, are used to be compared in two case studies to
demonstrate the outperform of the new platform.
20. Zdravko, M & Ingrid, R 2006, 'An introduction to the WEKA data mining system',
SIGCSE Bull., vol. 38, no. 3, pp. 367-368.
As the software which is widely used in data mining industry and written in Java, WEKA plays an
important role in my project-rules in DNA microarray technology. In this article, the authors give an
overview of WEKA data mining system. The goal of the article is to introduce to the packages and
pedagogical possibilities for its use. WEKA contains a rich range of powerful Machine Learning
algorithms for Data Mining tasks, which is including pre-processing, classification, clustering, and
graphical user interface. It introduces the topics covered in WEKA, including data pre-processing and
visualization, attribute selection, association rules, classification algorithms (OneR, Decision trees,
and covering rules), prediction algorithms, evaluation techniques and clustering (K-means, EM,
Cobweb).
21. Saharon, R 2005, Robust boosting and its relation to bagging, ACM, Chicago, Illinois,
USA.
Yi LI
- 31 -
The author point out in this paper that boosting and bagging are two approaches to combining weak
models in order to build prediction models that are significantly better. He also points out that the
general theoretical and practical consensus is that the weak learners for boosting should be really
weak, while the weak learners for bagging should actually be strong. He presents an approach of
weight decay for observation weights which is equivalent to robustifying the underlying loss function.
He also illustrates the practical usefulness of weight decay for improving prediction performance and
presents equivalence between one form of weight decay and Huberizing, which is a statistical method
for making loss functions more robust.
22. Hong, H, Jiuyong, L, Hua, W, Grant, D & Mingren, S 2006b, A maximally diversified
multiple decision tree algorithm for microarray data classification, Australian Computer
Society, Inc., Hobart, Australia.
In this paper, authors investigate the idea of using diversified multiple trees for Microarray data
classification. They propose an algorithm of Maximally Diversified Multiple Trees (MDMT), which
makes use of a set of unique trees in the decision committee. By comparing MDMT with some wellknown ensemble methods, including AdaBoost, Bagging, and Random Forests, they find that both
MDMT and CS4 are more accurate on average than AdaBoost, Bagging, and Random Forests. To
compare those two algorithms, CS4 is capable of finding informative genes and the combinations of
informative genes with informative genes, either with less informative genes; MDMT is capable of
discovering combinations of informative genes with informative genes, and less informative genes
with less informative genes. They both have strengths and weakness. Authors also discuss what
improvements they can get.
23. Chen, X, Li, J, Daggard, G & Huang, X 'Finding Similar Patterns in Microarray Data',
Lecture notes in computer science, pp. 1272-1276.
In this paper, authors propose a clustering algorithm called s-Cluster for analysis of gene expression
data based on pattern-similarity. According to the authors, the algorithm captures the right clusters
exhibiting strong similar expression patterns in Microarray data. Unlike other algorithms, s-Cluster
allows a high level of overlap among discovered clusters without complete grouping them. The fact in
biology researches is that not all functions of genes are turned on in an experiment. They apply the sCluster algorithm to yeast Saccharnmyces cerevisiae cell cycle expression data. S-Cluster algorithm is
Yi LI
- 32 -
approved to be a better one which groups genes with strong similar expression patterns and that the
found clusters are interpretable.
24. Li, J, Topor, R & Shen, H 2002, 'Construct robust rule sets for classification',
Proceedings of the eighth ACM SIGKDD international conference on Knowledge
discovery and data mining, pp. 564-569.
In this paper, authors study the problem of computing classification rule sets from relational databases.
Everything has been done in data pre-processing process is to improve the accuracy in predicting
without missing attribute values. These traditional methods do not work properly on real data. In other
words, they only work for training data, which is perfect and no missing values. The concept
introduced in this paper is more robust than others. It is able to make more accurate predictions on test
data with missing attribute values. The k-optimal rule set, which is introduced by authors, leads to a
hierarchy of k-optimal rule sets in which decreasing size corresponds to decreasing robustness. Two
methods to find k-optimal rule sets are introduced then, i.e. an optimal association rule mining
approach and a heuristic approximate approach. Authors use experiments to approve that k-optimal
rule set have better performance than a typical classification rule set on test data.
25. Li, J, Shen, H & Topor, R 2004, 'Mining Informative Rule Set for Prediction', Journal of
Intelligent Information Systems, vol. 22, no. 2, pp. 155-174.
In this paper, authors define a new rule set, called informative rule set, for mining transaction
databases. Compared with the traditional rule, the new one is much smaller than it, but makes the
same predictions. The large numbers of rules, which are generated by traditional association rule, are
unnecessary. The advantages of informative rule are: it is not constrained to particular target items;
and it is smaller than the non-redundant association rule set. In this paper, authors also present an
algorithm to directly generate the informative rule set without generating all frequent itemsets firstly.
They prove their statement by using a series of experiments to show how the informative rule set is
smaller and can be efficiency.
26. Diaz-Uriarte, R & Alvarez de Andres, S 2006, 'Gene selection and classification of
microarray data using random forest', BMC Bioinformatics, vol. 7, no. 3.
Authors firstly point out that most researchers try to identify the smallest possible set of genes that
can still achieve good predictive performance. Random forest is a classification algorithm suits for
Yi LI
- 33 -
microarray data, because it shows excellent performance even when most predictive variables are
noise; and it can be used when the number of variables is much larger than the number of
observations. Authors study for the use of random forest for classification of microarray data, and
propose a new method of gene selection in classification problems based on random forest. By using
simulated and nine microarray data sets, authors show that the random forest has comparable
performance to other classification methods, including DLDA, KNN, and SVM.
27. Yang, YH, Dudoit, S, Luu, P, Lin, DM, Peng, V, Ngai, J & Speed, TP 2002,
'Normalization for cDNA microarray data: a robust composite method addressing single
and multiple slide systematic variation', Nucleic Acids Research, vol. 30, no. 4, p. e15.
In this paper, author firstly point out that in cDNA microarray experiments, systematic variation often
affect the measured gene expression levels. The term normalization refers to the process of removing
such variation. Researchers are trying to adjust the distribution of the intensity log ratios to have a
median of zero for each slide. In this paper, authors propose normalization methods that are based on
robust local regression and account for intensity and spatial dependence in dye biases for different
types of cDNA microarray experiments. Lastly, according to author, “a robust method based on
maximum likelihood estimation is proposed to adjust for scale differences among slides”.
28. Bozinov, D & Rahnenfuhrer, J 2002, Unsupervised technique for robust target separation
and analysis of DNA microarray spots through adaptive pixel clustering, Oxford Univ
Press, pp. 747-756.
Due to the characteristic imperfections, microarray images always challenge the existing analytical
methods. Problems such as irregular contours, donut shapes, artifacts, force people to propose a new
approach to ensure accurate data extraction from these images. The authors introduce a novel method
for intensity assessment of gene spots. It is based on clustering pixels of a target area into foreground
and background. Two clustering algorithm, k-means and Partitioning Around Medoids (PAM), are
used to produce the new method. The results of implementing the new method show that PX(PAM)
and PX(KMEANS) are have high robust against other various types. According to the authors, the
implementation of this method is a combination of two complementary tools Extractiff (Java) and
Pixclust.
Yi LI
- 34 -
29. Liu, L, Hawkins, DM, Ghosh, S & Young, SS 2003, 'Robust Singular Value
Decomposition Analysis of Microarray Data', Proceedings of the National Academy of
Sciences of the United States of America, vol. 100, no. 23, pp. 13167-13172.
The authors research at University of California, Berkeley USA, are interested in working out a
statistical technique to help discern possible patterns in biological samples in microarray data. They
find a technique which applies a combination of mathematical and statistical methods to progressively
take the data set apart, so that to allow different aspects can be examined for both general patterns as
well as specific effects. Due to the extreme values (outliers), missing values, and abnormal values,
authors develop a robust analysis method to deal with the problem. The benefits of this method are
including the understanding of large-scale shifts and the isolation of particular sample-by-gene.
30. Bolshakova, N, Azuaje, F & Cunningham, P 2005, An integrated tool for microarray
data clustering and cluster validity assessment, Oxford Univ Press, pp. 451-455.
The authors, research at Trinity College Dublin, work on a data mining system, which allows the
application of multiple clustering and cluster validity algorithms for DNA microarray data. This tool,
called Machaon CVE system, not only can be used to do clustering, also may help evaluation of the
clustering scheme or cluster validation. Five validation and two clustering techniques have been
implemented in this system. In future, it can improve the quality of dataset analysis outcomes, and
may support the prediction of the relevancy clusters in the microarray area. This systematic evaluation
approach would significantly aid genome expression analyses for knowledge discovery applications.
Its clustering and validating functionalities may not only be used in DNA microarray expression
analysis applications, but also other biomedical and physical data with no limitations.
Yi LI
- 35 -
Appendix-A
Diagram (Babu, 2004b) of the process of how microarray slides is extracted from the living cell. (A)
Microarrays is usually a glass or polymer slides, onto which DNA molecules are attached at fixed
locations called spots or features. Each spot contains oligonucleotide sequence or genomic DNA that
uniquely represents a gene. (B) Two groups of microarray samples are under test condition A and
normal condition B, respectively. The messenger RNA (mRNA) is extracted from the cells since it
brings all gene information that will express out. After labeling with dyes, cDNA is generated.
Yi LI
- 36 -
Appendix-B
Diagram provided by Becquet et al (2002).
Yi LI
- 37 -