Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UNIVERSITY OF SOUTH AUSTRALIA Assignment Cover Sheet – Internal An Assignment cover sheet needs to be included with each assignment. Please complete all details clearly. If you are submitting the assignment on paper, please staple this sheet to the front of each assignment. If you are submitting the assignment online, please ensure this cover sheet is included at the start of your document. (This is preferable to a separate attachment.) Please check your Course Information Booklet or contact your School Office for assignment submission locations. Name: Li Yi Student ID 1 0 0 0 5 0 3 3 3 Email: [email protected] Course code and title: CIS Research Methods INFT 4017 School: CIS Program code: LHCP Course Coordinator: Dr. Ivan Lee Tutor: Dr. Ivan Lee Day, Time, Location of Tutorial/Practical: 9:10am~11:00am Wednesday, GP1-09, MLK Due date: 15th June, 2008 Assignment number: Assignment 2a Assignment topic as stated in Course Information Booklet: Assignment 2a Further Information: (e.g. state if extension was granted and attach evidence of approval, Revised Submission Date) None I declare that the work contained in this assignment is my own, except where acknowledgement of sources is made. I authorise the University to test any work submitted by me, using text comparison software, for instances of plagiarism. I understand this will involve the University or its contractor copying my work and storing it on a database to be used in future to test work submitted by others. I understand that I can obtain further information on this matter at Note: The attachment of this statement on any electronically submitted assignments will be deemed to have the same authority as a signed statement. Date: 15th June, 2008 Signed: Li Yi Date received from student Recorded: Assessment/grade Assessed by: Dispatched (if applicable): School of Computer and Information Science University of South Australia Building rule-based models in Microarray data Research Proposal By Student Yi LI Student ID 100050333 Program LHCP Supervisor Associate Professor Jiuyong LI Date 15th June 2008 Disclaimer I declare that all the following to be my own work, unless otherwise referenced, as defined by the University of South Australia’s policy on plagiarism. Yi LI Date: 15th June 2008 Table of Contents 1.0 Introduction ............................................................................................................................... 1 1.1 1.2 2.0 1.1.1 Microarray Rationale........................................................................................ 1 1.1.2 Data Representation .......................................................................................... 2 Research Motivation ...................................................................................................... 3 1.2.1 Current data preprocessing in Microarray data ............................................ 3 1.2.2 Research Question and hypothesis................................................................... 5 Literature Review ...................................................................................................................... 6 2.1 2.2 3.0 Background ..................................................................................................................... 1 Literature review in Machine Learning ....................................................................... 6 2.1.1 Decision tree induction ...................................................................................... 6 2.1.2 Ensemble Learning............................................................................................ 9 2.1.3 Bagging ............................................................................................................. 11 2.1.4 Boosting ............................................................................................................ 11 2.1.5 Random Forests ............................................................................................... 12 2.1.6 Optimal Rule based Classifier ........................................................................ 12 Literature review in statistical fields .......................................................................... 13 2.2.1 Feature selection methods............................................................................... 13 2.2.2 Discretization methods .................................................................................... 14 Research Design ....................................................................................................................... 16 3.1 Hardware and software ............................................................................................... 16 3.2 Research Methodology ................................................................................................. 16 3.2.1 Problem definition ........................................................................................... 16 3.2.2 Research steps .................................................................................................. 17 3.3 Expected outcomes ....................................................................................................... 18 3.4 Contributions ................................................................................................................ 18 4.0 Timetable ......................................................................................................................19 5.0 Summary .................................................................................................................................. 20 6.0 Reference List .......................................................................................................................... 21 7.0 Bibliography ............................................................................................................................. 23 8.0 Appendices A............................................................................................................................ 35 Appendices B ............................................................................................................................ 36 1.0 Introduction With the great achievements in genome sequencing technology, scientists now have the ability to identify most of the genes. However, without an effective DNA sequencing technology, current genome sequencing is just transfer DNA into electronic format. Scientists hope to understand the functionalities of genes, especially how they relate to certain diseases. The development of DNA microarray technology makes it possible for scientists to make snapshots of gene expressions in a single experiment. Microarray technology (Duggan, Bittner, Chen, Meltzer & Trent 1999, Cheung, Morley, Aguilar, Massimi, Kucherlapati & Childs 1999) are also refer to DNA microarrays, DNA arrays, DNA chips, and gene chips (Causton, Quackenbush et al. 2003). Microarrays are effectively transforming a living cell from a black box into a transparent box. They allow scientists to identify the genes that are expressed in different cell types, thus to study how their expression levels change in different developmental stages or disease states (Causton, Quackenbush et al. 2003). 1.1 Background Microarray technology has been successfully widely used to produce gene expression data that can reveal important information of how genes work and their relationships with some certain diseases. However, to transform these data into knowledge level is not easy work. Advanced statistical tools and computational technologies are needed to analyse such huge amounts of complex data. 1.1.1 Microarray Rationale There is another term often used in bioinformatics: gene expression data. The genetic information of cellular organisms is stored in a long sequence of four different deoxyribomucleotides (A, G, C and T). These nucleotides are the DNA molecules that compose the genome of an organism (McLachlan, Do et al. 2004). The genome contains segments of DNA that used to encode genes. The information which uniquely identifies a living cell is transcribed from DNA into messenger RNA (mRNA), and then translated to build proteins. The whole process is called gene expression(Causton, Quackenbush et al. 2003). Figure 1.1 shows the process of how DNA is used as a template to form proteins. A diagram with more detailed information is provided in Appendix A. According to Causton et al (2003, p.3), since “the relationship between mRNAs and the genes that encode them can be readily be identified, based on the relationship between their sequences”, this Yi LI -1- property is exploited in microarray experiments, and is considered as the rationale of microarray technology. DNA Transcription (segments of DNA, ’genes’, are mRNA (mRNA used as templates for DNA abundance detected using microarrays) synthesis) Translation Regulation of gene expression Protein Folding Cell structure Replication Protein Protein Repair function structure Metabolism Figure 1.1: The information transfer between DNA, mRNA and protein (Causton, Quackenbush et al. 2003) 1.1.2 Data Representation A microarray is typically a glass or polymer slide, onto which DNA molecules are attached at fixed locations called spots or features (Causton, Quackenbush et al. 2003). It contains tens of thousands of spots on an array, and again each spot may contain millions of genes samples. In practice, the spots are printed on the microarrays by a robot or jet. As it is pointed out by Causton et al (2003, p.40), in glass slide DNA microarray experiments, mRNA from the cells and tissues of interest are used to generate first-strand DNA (cDNA) labeled with spectrally distinguishable fluorescent dyes such as Cy3 and Cy5. Regardless of approach, the arrays are then scanned and generate for query and control samples, typically as 16-bit TIFF images. These images are subsequently been analyzed to identify the spots and to measure the fluorescence intensities for each feature (Causton, Quackenbush et al. 2003). Some commercial software like Yi LI -2- Affymetrix GeneChips TM provides effective tools for assistance. With the identification and intensity measurement done, scientists usually conduct a series of normalization work, such as linear regression and mean log centring, to make the results in similar format to make effective comparisons. The whole process of a microarray experiment is usually divided into two stages: transformation of the raw data into a gene expression matrices, and analysis of the gene expression matrices. The term Gene Expression Matrices is referred in both stages. Gene Expression Matrix Gene expression matrix is a matrix with rows and columns. According to Causton et al (2003, p.71), rows in a gene expression matrix represent genes, columns represent experimental conditions or samples, and the values at each position in the matrix characterise the expression level of the particular gene under the particular experimental condition. The expression levels of genes are represented in either absolute levels (in abstract units), or relative values transformed into logarithms (Causton, Quackenbush et al. 2003). In the latter case, the original gene expression values are usually converted into ratios or log ratios, with the information about absolute values lost (Causton, Quackenbush et al. 2003). For example, ratio values for 400/200, 40/20, and 4/2 return same results. Most gene expression data analysis algorithms assume that the gene expression levels are represented as one numerical value for each expression level. Vector Space Gene expression values can be represented as a set of their corresponding vectors in a multidimensional space. By doing so is aim at to effectively identify and measure similarity and distance between those genes. Many feature selection algorithms require values are in vector format. 1.2 Research Motivation After representing gene expression data into either gene expression matrix or vector spaces, many computational and statistical data analyzing techniques can be applied to get insight into the potential relationships hided in the huge amount of gene expression data. 1.2.1 Current Data Preprocessing in Microarray Data By transforming gene expression data into vector space format, it allows scientists to use data analysis methods and to visualise different data transformations in the respective vector space. Gathering, organising, and preparing data for statistical analysis is referred to collectively as preprocessing Yi LI -3- (Berrar, Dubitzky et al. 2003). Due to the nature of biological gene structure, microarray data comes with the difficulties for scientists: high dimensional and small number of samples. Currently, researchers use feature selection methods to reduce the dimensions in gene expression data. Those methods are including Pearson correlation coefficient (PCC), Chi-square, and mutual information. The literature review of feature selection methods will be discussed in the next section. As it is mentioned above, the gene expression data is usually represented as a series of continuous numbers. This brings difficulty to scientists that for two seasons. Firstly, some data mining methods, e.g. association rule mining, require data in a discretized form (Han et al, 2006). Secondly, categories are closer to a knowledge-level representation than continuous ones. So a proper data discretization method is needed for microarray data analysing for understandable results. A number of discretization methods have been applied to microarray data since the late 1990s. For example, there are “mid-ranged” method, “max minus x%” method and “x% of highest value” method. Like other discretization methods, those three methods are relying on statistical parameters, such as mean, median, and standard deviation. Usually gene expression data are categorized as: i) Under-expressed; ii) balanced; and iii) over-expressed. The literature review of those three methods is in the next section. The current used discretization methods are great to some extent. The past researchers use them to categorize gene expression data for future studies. However, with the development of the technology and the increasing needs of better medical care, the analysing of microarray data requires a better insight. The limitations occurred in current discretization methods become the bottleneck of the development of the microarray data analysing. The limitations occurred in those methods are various. In “Max minus x%” method, if the maximum value in the data set is turn out to be a noise, then the whole discretization will be wrong by based on a wrongly defined maximum value. The main limitation in those methods is that using different methods, some values are categorized from mixed categories. From Appendix B, the data points near the solid line have the risk of being mixed categorized by different methods. If those mixed categorized data points are used to build models, it will cause errors in future predictions. An arbitrary discretization cut point may lead to randomly categorize points whose values are around boundaries. Yi LI -4- On the other hand, the data points far away from the boundaries are clearly categorized. If only the data points that are clearly categorized are used for building models, and those data points that are between two distinct categories are excluded, the models would be more accurate. Thus, an optimal discretization method is needed to clearly categorize uncertain values, which will be discarded when build models. 1.2.2 Research Question and Hypothesis Based on the discussion above, a question is raised. Research question: How to build a rule-based model in microarray data with proper data discretization? Proposed hypothesis: A model built by discarding uncertain discretized values in microarray data will return better accuracy. In the next section, literature review on machine learning methods which are used in building up classification models will be reviewed. In addition, some statistical methods used to do feature selection and similarity comparison among gene vectors are reviewed. Yi LI -5- 2.0 Literature Review Microarray study is a young, but rapidly maturing field, and it has attracted attentions from different fields, such as biology, medical, mathematical, statistical, machine learning and computer science. The potential complexities of microarray data are spawning rich statistical and computational literature. Researchers from different areas are devoted themselves into the exploration of extracting information from raw gene expression data. This section will go through all methods that involved in the proposed honours research work. All methods are presented by abstracting the paper published by their original developer(s), including Rose Quinlan (1986a) for decision tree induction, Quinlan (1986b) for decision tree simplifying, Leo Breiman (1996) for bootstrap aggregating, Breiman (2001) for random forest, Freund Y. and R. E. Schapire (1997) for boosting algorithm, and J. Li (2006a) for optimal rule based classifier, Li (2006b) for robust rule based prediction. 2.1 Literature Review in Machine Learning To transform gene expression data into knowledge level is not an easy task. It is more likely a computational or statistical task. As a broad subfield of artificial intelligence, machine learning boots up the speed of microarray data learning and processing. Machine learning, generally speaking, is to design and develop a series of algorithms and techniques that allow computers to learn (Simon 1981). Several machine learning algorithms have been developed and widely used. 2.1.1 Decision Tree Induction Learning Decision tree induction is a widely used machine learning algorithm for classification. It is firstly developed by J. R. Quinlan from University of Sydney, Australia. He published a series of paper on decision tree algorithms from early 1980’s to 2002. The detailed information about his publications on decision tree induction and machine learning are listed in annotated bibliography. Yi LI -6- Figure 2.1: The TDIDT family (J.R. Quinlan, 1986) Quinlan firstly developed ID3 algorithm for decision tree constructions. Along with the development of techniques and increasing needs of high performance classification algorithms, the algorithms of decision tree construction are upgraded to provide capabilities to handle more complex problems. The most widely used algorithms are ID3, ASSISTANT and C4.5. The Top-Down Induction of Decision Trees (TDIDT) family (shown in Figure 2.1) contains those algorithms to construct decision trees developed since 1963. According to (Quinlan 1986a), decision tree is constructed in a top-down recursive divided-andconquer manner. It classifies examples (instances) by sorting them down the tree from the root to some leaf node, which provides the classification of the instance. Each node in the tree specifies an attribute, and each branch descends from that node corresponds to one of the possible values for this attribute (Quinlan 1986b). An instance is classified by starting at the root of the tree, after evaluating the attribute specified by this node, then moving down the tree branch corresponding to the value of the attribute in that instance. In addition, attributes in the tree theoretically are categorical data. If the values of those attributes are continues data, they should be discretized in advance (Han and Kamber 2006). Meanwhile, after finishing, no instance is left out; and there are no remaining attributes require further partition (Han and Kamber 2006). There are many measures that can be used to select the best splitter attribute to denote each node. Examples of impurity measures include (Quinlan 1986b): c 1 Entropy (t) = - p(i | t ) log i 0 Yi LI 2 p (i | t ) -7- c 1 Gini (t) = 1- [ p(i | t )] 2 i 0 Classification error (t) = 1- max [ p (i | t )] i Where c is the number of classes, and 0log 2 0 = 0 in entropy calculations. There are other kinds of impurity measures, such as information gain and Gini ratio. Impurity measures are used to weight each attribute so that to pick out the attribute with the highest impurity value. The attribute with the highest impurity value is selected as the best splitter to build a child node. Decision tree construction algorithms have natural concurrency. Once a node is generated, all of its children in the decision tree can be generated concurrently (Srivastava, Han et al. 1999). A decision tree can be constructed quickly so that it does not cost much computational resources. Decision tree is a non-parametric approach for building classification models (Tan, Steinbach et al. 2005). In other words, it does not require any prior assumptions regarding the type of probability distributions satisfied by the class and other attributes (Tan, Steinbach et al. 2005). In technical point of view, the computational resources for building up a decision tree are not expensive. It makes it possible to quickly build up a decision tree for a large data set like gene expression data. The outcome of J48 in Weka is tree-structured. Branches are stands for conditions, nodes are stands for attributes, and leafs are class labels. It is interpretable, and a decision tree can be re-presented as sets of if-then rules to improve human readability (Quinlan, 1986b). According to Tan P and Steinbach, et al (2005, p.169), decision tree algorithms are quite robust to the presence of noise. It introduced methods to address overfitting problem by pruning those parts which are constructed based on noise values. The decision tree algorithms have been extensively tested in the past few years. However, the limitations of decision tree algorithms still exist because of various reasons. Firstly, the shape of the tree is highly irregular, and is determined only at runtime (Srivastava, Han et al. 1999). Meanwhile, the size of the decision does not necessarily associated with the size of the data set. For example, usually there are thousands of attributes in a microarray data set. To diagnose a Yi LI -8- cancer, it is only determined by less than ten attributes. According to the J48 scheme, Weka will weight the entropy of each attribute to select the best splitter. Another scenario is to apply a data preprocessing filter so that Weka may do feature selection before actually start building model. No matter which method is used, the tree constructed is small-shaped. The reason is that there are only less than ten critical attributes which may draw attention by Weka. In comparison, if a data set contains only 20 attributes and most of them are critical to determine the splitters. The size of the decision tree built for this data set is large (assume after pruning). The reason is that each node is decided by the impurity measures (i.e. entropy and Gini index). The more critical an attribute is, the higher possibilities it is selected as a splitter. Secondly, the amount of work associated with each node also varies, and is data dependent (Srivastava, Han et al. 1999). Hence any static allocation scheme is likely to suffer from major load imbalance. 2.1.2 Ensemble learning “Ensemble learning is a machine learning paradigm where multiple learners are trained to solve the same problem” (Z.H.Zhou 2008). Be different from ordinary single machine learning approaches, which try to obtain one hypothesis from training data, ensemble learners try to build a set of hypothesis and combine them to get better prediction results. To achieve high performance in prediction, many researchers start to combine several classifiers together for ensemble learning. Leo Breiman from Statistics Department, University of California, Berkeley proposed the concept of ensemble learning in 1996. The idea of ensemble learning is shown in Figure 2.2. Base learners are usually generated from training data by its base learning algorithm, which can be a decision tree, neural network or other kinds of machine learning algorithms. Basically, an ensemble can be constructed in three steps. Firstly, a number of data sets are created based on the original data set. Secondly, a number of base classifiers are produced, which can be generated in a parallel style or in a sequential style, where the generation of a base classifier has influence on the generation of subsequent learners (Z.H.Zhou 2008). Thirdly, those base classifiers are combined to use. Yi LI -9- D Original training data Step 1: Create Multiple Data Sets D1 D2 Dt 1 C2 Ct 1 Dt Step 2: Build C1 Ct Multiple Classifiers Step 3: Combine C* Classifiers Figure 2.2 Diagram of ensembles (Tan, Steinbach et al. 2005) The error rate of ensemble learner is the accumulation of all weak learners. According to Breiman (1996), an ensemble learner is workable if weak learners are independent to each other, and their bias are lower than 0.5. For example, if one weak learner has 25 independent base learners, each with error rate of 0.35 , then the ensemble learner built based on it will have the error rate of: 25 i 25i 1 0.06 i 13 25 i This example is cited from Tan et al (2003, p.274). On one hand, it illustrates how the idea of ensemble learning is booted. On the other hand, it states that with error rate higher than 0.5, the ensemble learner will accumulate the error rate and work worse than single learner. Yi LI - 10 - 2.1.3 Bagging Bagging, also called bootstrap aggregating, bootstrap sampling, is first developed by Leo Breiman in 1996. Breiman (1996) points that that bagging is well-known as a method for estimating standard errors, bias, and constructing confidence intervals for parameters. Bagging is usually obtained by sub-sampling the training data set with replacement, where the size of a sample is as the same of the training data set (Breiman, 1996). It does not add or replace to the original data set. It just grabs data from the original data set, and then put it into the sub data set which is generated using the data from original data set. In the process of generating sub data sets, some data may appear more than once, while some may not appear at all. According to Breiman (1996), each sample has probability 1 1 / nn of being selected, and the probability that one data appears at least once is 0.632. Averaging and majority voting are used to combine all classifiers, and the class with most-voted is predicted (Breiman, 1996). 2.1.4 Boosting Boosting is another widely used ensemble learning algorithm. It is firstly developed by Freund Y. and R. E. Schapire from AT&T Labs in 1996. Boosting is an iterative procedure to adaptively change distribution of training data by focusing more on previously misclassified records (Freund and Schapire 1997). The main idea of boosting algorithm is to increase the weights of wrongly predicted records, so that they have more chance to be selected in the subsequent round. It is aim at boosting weak learners to strong learners by improving their accuracy rate in each round. Among the various boosting algorithms applications, AdaBoost is the most used algorithm. The algorithms are summarized like the following. Firstly, all N records are assigned equal weights with 1/ N . Unlike bagging, weights in boosting algorithm may change at the end of each round. Then, from the training data set Dt , the algorithm generates a base learner ht : X Y . Next, it uses the training examples to test ht . Records that are classified correctly will have their weights decreased, while records that are wrongly classified will have their weights increased by: ( j 1) i w Yi LI wi( j ) Zj j exp j exp if C j xi y i if C j xi y i - 11 - where i is the index of the round, and Z j is the normalization factor (Freund & Schapire 1997). Thus, an updated weight distribution D( t 1) is generated. Again the updated learner tests the data set with updated weights. Such a process is repeated for T times, each of which is called a round. 2.1.5 Random Forests Random Forests is a powerful new approach to data exploration, data analysis and predictive modelling. It is firstly developed by Leo Breiman, the father of CART (Classification and Regression Tree), at University of California, Berkeley in 2001. It has its roots in CART, learning ensembles, and committees of experts. Random forests are a combination of tree predictors that each tree depends on the values of a random vector sampled independently, and with the same distribution for all trees in the forest (Breiman, 2001). The error rate of the forest is limited since the number of trees is large. Each tree in the forest is randomly binary split and independent. The accuracy rate of a forest depends on the strength of the individual trees in the forest and the correlation between them (Breiman, 2001). Random Forests is robust with respect to noise. 2.1.6 Optimal Rule based Classifier The Optimal Rule based Classifier (ORC) is an optimal rule based classifier firstly developed by J. Li, at University of Southern Queensland Australia in 2006. ORC is an optimal rule based classifier which is developed by the motivation to in cope with the large amount of useless rules generated by normal association rules algorithms. ORC is an efficient alternative to association rule discovery, especially when the minimum support is low (Li 2006a). Li (2006a) defines a rule with its both support and interestingness not less than given thresholds as a strong implication. By following, Li (2006a) the general and specific relationships like this: Given two rules P c and Q c , where P Q . Then the latter is more specific than the former and the former is more general than the latter. The basic idea of ORC is that the removal of a specific rule from a rule set does not reduce the total coverage of the rule set. The rule set with all non-interestingness, or specific rules removed, is an optimal rule set. Yi LI - 12 - Li (2006a) extents the definition by presenting his algorithm for the optimal rule discovery (ORD) and optimality pruning. The main contribution of the development of the Optimal Rule based Classifier is that it is significantly more efficient than association rule discovery independent of data structure and implementation (Li, 2006a). In addition, it generates accurate rules that all satisfied certain constraints even the minimum support is low. 2.2 Literature Review in Statistical field As many researchers and scientists use statistical methods to do data preprocessing in Microarrays studies, this section will give brief review on the literatures on data preprocessing methods, especially those used for continuous numbers attributes discretization. 2.2.1 Feature Selection methods Although the research experiments are not finally aim at comparing the feature selection methods, the experiments cannot skip them because feature selection provide effective dimension reduction in gene expression data and speed up the data analysis process. In this section, literatures in feature selection will be briefly reviewed. Pearson Correlation Coefficient According to Causton (2003, p.85), Pearson correlation coefficient (PCC) is the widely used approach to explore the similarity of genes and reduce the dimensions in gene expression data. PCC firstly calculate the mean value for each gene expression profile. Next, it shifts each profile down by mean centring. At last, calculate PCC as the cosine of the angle between the mean-centred profiles. Given two expression profiles A and B with two samples in each. They can be represented in a three-dimensional space as A a1, a 2 , a3 and B b1 , b2 , b3 . Step 1: calculate the mean. a a1 a2 a3 / 3 and b b1 b2 b3 / 3 Step 2: mean centring. A a1 a , a 2 a , a3 a and B b1 b , b2 b , b3 b Step 3: get PCC. Yi LI - 13 - PCC n AB where A B ai a bi b AB i 1 In n dimensional space, covariance is calculated as below to reduce dimensions: Cov A, B AB n 1 2.2.2 Discretization methods There are many papers published in the past few years that give overviews to discretization techniques used in gene expression data. According to Becquet et al (2002), all quantitative values in gene expression data have given rise to one Boolean value, that is, true (1) or false (0). Becquet et al (2002) proposed three different discretization procedures, which are “the max minus x%”, “midranged”, and “x% of highest value” approaches. Assume a given expression data is denoted as d. Max minus x% “Max minus x%” method consists of identifying the maximum expression value (MaxValue). “Max minus x%” defines a value of 1 when the expression value is greater than (MaxValue-x) /100. Otherwise the expression of the value is assigned a value of 0. The default value for x is set to 25. 1 if d ( MaxValue 25) / 100 v 0 if d MaxValue 25 / 100 Mid-Ranged Becquet et al (2002) also analyses the effect of a “mid-ranged” based cut-off approach. This approach involves identifying the maximum (MaxValue) and minimum (MinValue) values in the expression data. The mid-range value is set as being equidistant from these two numbers, that is, their arithmetic mean (Becquet et al, 2002). All expression values below or equal to the mid-range value are set to 0, and all values strictly above the mid-range are set to 1. 1 if d MaxValue MinValue / 2 v 0 if d MaxValue MinValue / 2 X% of highest value Yi LI - 14 - “X% of highest value” approach involves identifying the highest 5% of the whole data set. The expression data which in that interval are assigned to value 1 and the rest are set to 0. In the following formula, the lowest value in the highest 5% of the whole data set is set to . 1 if d v 0 if d Yi LI - 15 - 3.0 Research Design This section will go through the research methodologies that will guide the whole process of experiments, including data collection, pre-processing, experiments conducting and re-testing. 3.1 Hardware and Software This research does not require any special equipment. All methods implemented are to conduct on a PC running Linux Red Hat 8.0 or Windows. The software required for conducting the research experiments are: Linux Server comes with Intel Core 2 Duo 3 GHz, 4G memory and 500G hard disk plus 500G external hard disk Ubuntu version 8.04 Open SSH Secure Shell client, including secure file transfer client R language programming tool, or Matlab For the network connection bandwidth, current fibre optic connection in UniSA Mawson Lakes campus has the capability to provide efficient bandwidth for file transmission and uploading. 3.2 Research Methodology A positivist qualitative methodology has been selected to achieve the major outcomes of this research thesis. The methodology is focus on verifying the hypothesis proposed, which is “a model built by discarding uncertain discretized values in microarray data will return better accuracy”. The proposed hypothesis is aim at answering the research question “How to build a rule based model in microarray data with proper data discretization?” 3.2.1 Problem Definition By getting the past papers on discretization in microarray data reviewed, the existing shortcomings and limitations in discretizing continuous values in large-scale microarray data are addressed. Since the performance and accuracy of existing discretization methods are still in debate, an optimal or new discretization method particularly for microarray is needed to get understandable results. Yi LI - 16 - Current discretization methods may get some data points mixed from two categories. The proposed research methodology is to combine the existing methods to obtain those data points that are mixed from two categories. In addition, the proposed method is to overcome the problem by clearly denote over-expressed and under-expressed values, and leave those uncertain points alone. If possible, the proposed method will be simplified and extended. The purpose of this project is to compare differences among different discretization methods. Thus the bias cause by the different classification models and feature selection methods are not in the boundary of this project. 3.2.2 Research Steps This honours project consists of the following steps: 1. Collect 4~5 microarray data sets. Most microarray data sets are restricted to open to public. Some microarray data sets are for research purpose only. The quality of the data sets will affect the experiments results. Therefore, the microarray data sets should be collected from certain authorized Universities or research institutions. For example, the School of Medicine, Stanford University has done remarkable work in lung cancer and liver tumour studies. 2. Data preprocessing. This is the most important and time consuming step in this project. In data preprocessing step, the following work needs to be done: Data integration and format normalization – data sets that describe same domain of information but obtained from different sources may vary in format and attribute names. Firstly the various data sets should be integrated together according to unified attribute names. It involves transform attribute names from probe names to a unified list of gene names. Replication values removal – the duplicated values should be removed to get data redundancy. The removal of duplicated data will not reduce the coverage of data. Missing values handling – missing values in a data set should be handled, by either remove them or using other methods, such as replace them by mean of other values, or replace them by the most frequent values occurred. Missing values will affect the results of experiments, since some data mining techniques are not robust in the presence of missing values. Apart from those, Yi LI - 17 - Discretization –this is what needs to be figured out. Which discretization method is the most suitable one in microarray data analysis? Feature selection – i.e. attribute subset selection. It is aim at selecting a minimum set of features such that the probability distribution of different classes given the values for those features is as close as possible to the original (Han et al, 2006). 3. Select classification models to build models. As this project is not to compare the performance of classification models, it is just required to select the most commonly used classification algorithms, such as C4.5 and ORC (Li, 2006a). 4. Get experiments results and analysis. This step has not been reached so far. Currently, according to the methodology, after getting all experiments results handy, they need to be ranked according to each discretization method. It is to determine which discretization methods is the best, so the best methods with high accuracy rate ranked to compare the performance. 5. Conclude the project. This step is mainly thesis writing and experiments finalizing. 3.3 Expected Outcome As mentioned in the previous section, the expect outcomes from the project are including: A proposed discretization method, including algorithms, coding, and explanations A hypothesis with verified based on the research so far A written thesis 3.4 Contributions The proposed discretization method will not only be a useful and powerful approach for data discretizing in microarray data analysis, but also have contributions to allow more data mining methods be used in microarray data, e.g. association rules and tree-based models. In addition, the proposed discretization method will enhance the accuracy rate of the models that use the new method, thus to improve the performance of classification and predictions of some diseases. Last but not least, the enhancement in the performance of classification models may contribute to medical and cancer diagnosis. Yi LI - 18 - 3.5 Limitations Every new proposed algorithm requires tens of thousands of tests and evaluations. Due to the restrictions of time and resources, this project may have the following limitations: The data sets collected have the possibilities that they are not standard for the whole population. Microarray slides are obtained under certain experimental conditions. One microarray sample is obtained from one particular condition. However, that particular microarray slide cannot stand for the whole gene expression results. If a experiment draws an conclusion only based on that particular microarray slide, the result may be a random chance, since the sample is not big enough. In this project, only a few microarray data sets are available. If a result is drew from a experiment with small number of sample, it is not an convincing result or hypothesis. The differentiations and bias cause by choosing different classification algorithms and feather selection methods will affect the result to some certain degree. However, it cannot be overcame due to the resource limitations. Yi LI - 19 - 4.0 Time Table The time plan for doing research project in the remaining time is summarized as following: Task name Duration Start date Expected Comments Finish date 1 Research proposal 2wks 1 June 08 15 June 08 2 Data collection 0.5wks 1 July 08 5 July 08 3 Pre-processing 2.5 wks 8 July 08 25 July 08 Done Tasks 3~5 may be done recursively 4 Data analysing 4 wks 26 July 08 25 Aug 08 5 Running 6 wks 27 Aug 08 8 Oct 08 6 Re-testing 1.5 wks 9 Oct 08 16 Oct 08 7 Thesis writing 2 wks 17 Oct 08 25 Oct 08 Do paralleling with other tasks Late June is revision time for exams of Study Period 2, 2008. The date arrangement from July to Oct 2008 does not count in any public holidays and weekends. The workload assumed is 6 hours/day. All files and data are required to do backup regularly to prevent any human mistakes, equipment failure, and any unpredictable circumstances. In addition, it is needed to do regularly backup to prevent project delay caused by information loss. Yi LI Weekly meeting with supervisor or research colleagues. This time table may subject to adjust. - 20 - 5.0 Summary The title of this project is to build rule based models in microarray data. Microarray technology has made it possible for biologists to effectively extract similar patterns of genes, and cluster genes with similar functionalities and structures. Scientists hope to identify how gene expressed in different cell types, thus to find out the relationship between cell’s development stages and diseases states. However, to transform this big amount of data into knowledge-level is not an easy task. With the help of data mining technologies, scientists could find out the relationships and patterns in gene expression data, especially those which are signatures of some certain diseases. In addition, they can also classify genes into classes using classification model, and make predictions for the next coming instances. In analyzing microarray data, data preprocessing is very important. Gene expression data has high dimensions and small number of samples. Usually scientists use feature selection methods to reduce the dimensions. Gene expression data is usually represented as continuous numbers. Some data mining methods, e.g. association rule mining, requires the data in discretized format. What’s more, categorized values are more understandable than continuous values. Because of those two reasons, data discretization is required in microarray data analysis. Current used discretization methods have many limitations. For example, by using different methods, some data points around boundary are categorized from mixed categories. If scientists use all data points to build models, those values will increase the error rate of the model. A hypothesis is proposed. A model built by discarding uncertain discretized values in microarray data will return better accuracy. A series of experiments will be conducted to verify this hypothesis. 4~5 microarray data sets will be collected and a few feature selection methods are implemented to conduct experiments which will approve if the hypothesis is correct or not. Yi LI - 21 - Reference Babu, M. M. 2004a, "Introduction to microarray data analysis." Computational Genomics: Theory and Application, ed by RP Grant (Horizon Bioscience Norwich). Babu, M. M. 2004b, “An introduction to Microarray data analysis” MRC Lab page, visited on 15 June 2008, <http://www.mrc-lmb.cam.ac.uk/genomes/madanm/microarray/.> Breiman, L. (1996). "Bagging Predictors." Machine Learning 24(2): 123-140. Breiman, L. (2001). "Random Forests." Machine Learning 45(1): 5-32. Becquet, C., S. Blachon, et al. (2002). "Strong-association-rule mining for large-scale geneexpression data analysis: a case study on human SAGE data." Genome Biology 3(12): 0067.10067.16. Berrar, D. P., W. Dubitzky, et al. (2003). A practical approach to microarray data analysis. Boston ; London, Kluwer Academic. Cheung, V.G., Morley, M., Aguilar, F., Massimi, A., Kucherlapati, R., & Childs, G. 1999, Marking and reading microarrays’, Nature Genetics 21, 15-19. Causton, H. C., J. Quackenbush, et al. (2003). Microarray/gene expressions data analysis : a beginner's guide. Malden, MA, Blackwell Pub. Freund, Y. and R. E. Schapire (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting." Journal of Computer and System Sciences 55(1): 119-139. Han, J. and M. Kamber (2006). Data Mining: Concepts and Techniques, Morgan Kaufmann. Li, J. (2006a), ‘On optimal rule discovery’, IEEE Transactions on Knowledge and Data Engineering 18 (4), 460-471. Li, J. (2006b), ‘Robust rule based prediction’, IEEE Transactions on Knowledge and Data Engineering 18 (8), 1043-1054. Yi LI - 22 - McLachlan, G. J., K. A. Do, et al. (2004). Analyzing microarray gene expression data. Hoboken, N.J., Wiley-Interscience. McLachlan, G. J., K. A. Do, et al. (2004). Analyzing microarray gene expression data. Hoboken, N.J., Wiley-Interscience. Madeira, S. C. and A. L. Oliveira An Evaluation of Discretization Methods for Non-Supervised Analysis of Time-Series Gene Expression Data, Technical Report 42, INESC-ID, December 2005. Pensa, R. G., C. Leschi, et al. (2004). "Assessment of discretization techniques for relevant pattern discovery from gene expression data." Proceedings ACM BIOKDD: 24–30. Quinlan, J. R. (1986a). Simplifying Decision Trees, Massachusetts Institute of Technology. Quinlan, J. R. (1986b). "Induction of Decision Trees." Mach. Learn. 1(1): 81-106. Simon, H. (1981). The Sciences of the Artificial, . Cambridge, MA, MIT Press. Srivastava, A., E. H. Han, et al. (1999). "Parallel Formulations of Decision-Tree Classification Algorithms." Data Mining and Knowledge Discovery 3(3): 237-261. Tan, P. N., M. Steinbach, et al. (2005). Introduction to Data Mining, Addison-Wesley Longman Publishing Co., Inc. Boston, MA, USA. Z.H.Zhou (2008). "Ensemble learning." Encyclopedia of Database Systems. Yi LI - 23 - Bibliography 1. Boulesteix, A. L., G. Tutz, et al. (2003). A CART-based approach to discover emerging patterns in microarray data, Oxford Univ Press. 19: 2465-2472. The authors, research at University of Munich, Germany, have been working on Cancer diagnosis topics. Cancer diagnosis uses gene expression profiles, which require supervised learning and gene selection methods. After using many suggested approaches, the authors find that the method of emerging patterns (EPs) has the particular advantages of explicitly modelling interactions among genes, which may improve classification accuracy. They introduce a CART-based approach to discover EPs in microarray data. This tree-based method is computationally fast and intuitive and also assigns statistical relevance to the identified patterns. The authors assess the performance of their pattern search algorithm and classification procedure to simulated data and gene expression data from colon and leukemia cancer experiments. Finally they find out the new approach provides a versatile and computationally fast tool for elucidating local gene interactions and classification. 2. Kluger, Y., R. Basri, et al. (2003). "Spectral Biclustering of Microarray Data: Coclustering Genes and Conditions." Genome Research 13(4): 703. The authors, research at Yale University, USA, find out that the classification problems in DNA expression are linked, and usually researchers are more interested in finding out “marker genes” that are differently expressed in particular sets of conditions. The authors have developed a method that simultaneously clusters genes and conditions, finding out distinctive “checkerboard” patterns in matrices of gene expression data. The method introduced in this paper, spectral bi-clustering, is based on the observation that checkerboard structures in expression data can be found in eigenvectors corresponding to characteristics expression patterns across genes or conditions. The authors firstly apply the singular value decomposition (SVD), coupled with integrated data normalization techniques. The spectral bi-clustering is applied then to a publicly available cancer expression data sets, in order to examine the degree to which the approach is able to indentify checkerboard (marker data) structures. 3. McShane, L. M., M. D. Radmacher, et al. (2002). Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data, Oxford Univ Press. 18: 1462-1469. Yi LI - 24 - The authors, research at National Cancer Institute, Biometric Research Branch, USA, find out that cDNA microarray technology have made it possible to simultaneously interrogate thousands of genes in a biological specimen. But according to their experiments and researches, they find that clustering algorithms always detect clusters, even on random data, and it is easy to misinterpret the results without some objective measure of the reproducibility of the clusters. To work it out, in this paper, the authors present a series of statistical methods for testing for overall clustering of gene expression profiles; and define interpretable measures of cluster-specific reproducibility that facilitate understanding of the cluster structure. Then the authors apply these methods to cludidate structure in cDNA microarray gene expression profiles obtained on melanoma tumors and on prostate specimens. 4. Tuzhilin, A. and G. Adomavicius (2002). "Handling very large numbers of association rules in the analysis of microarray data." Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining: 396-404. In this paper, authors propose to use association rule discovery methods for determining associations among expression levels of different genes. In their research, one of the main problems in discovery is the scalability issue. Microarray usually contain very large amount of datasets, therefore, analysis of such big amount of data may generate a large number of associations that can often be measured in millions; what’s more, to process such big amount of algorithms may take long time, which may be measured in weeks at least. The authors of this paper rise one method to enable biologists to evaluate these very large numbers of discovered association rules in data mining processes. It is achieved by providing several rule evaluation operators, e.g. rule grouping, filtering, and browsing, to allow biologists to validate multiple individual gane regulation patterns at same time. 5. Wilson, D. L., M. J. Buckley, et al. (2003). New normalization methods for cDNA microarray data, Oxford Univ Press. 19: 1325-1332. In this paper, authors present two new normalization methods for cDNA microarrays. After the image analysis technique has been implemented into use, some sort of normalization must be applied to the microarrays before continue to detect differentially expressed genes. According to their study, they find out that normalization removes biases towards one or other of the fluorescent dyes used to label each mRNA sample, which may allow for proper evaluation of differential gene expression. the outcomes of their study is that they extend the non-linear normalization techniques by firstly bringing Yi LI - 25 - in a normalization method that deals with smooth spatial trends in intensity across microarrays; they next deal with normalization of a new type of cDNA microarray experiment. 6. Li, J., X. Tang, et al. (2008). "A novel approach to feature extraction from classification models based on information gene pairs." Pattern Recognition 41(6): 1975-1984. Authors who conduct a series of experiments on DNA microarray with data mining technologies have found out that one of the major challenges of analysing microarray data is how to extract and select efficient features from it for accurate cancer classification. To improve the accuracy, authors in this paper introduce a new feature extraction and selection method. This method works based on information gene pairs, and authors use five public microarray data sets which demonstrate that the feature subset selected by the proposed method performs well according to the expected outcomes. After comparing with the results generated by using other methods, they confirm that the new method they work on can improve the accuracy of cancer prediction, including breast cancer, adenocarcinoma and myeloid leukemia. 7. Gregory, P.-S. and T. Pablo (2003). "Microarray data mining: facing the challenges." SIGKDD Explor. Newsl. 5(2): 1-5. Authors firstly agree to the fact that Microarrays is a revolutionary new technology with great potential to provide accurate medical diagnostics, helping find the right treatment and cure for many diseases and provide a detailed genome-wide molecular portrait of cellular states. However, authors also point out several problems and issues which microarrays can get improved. For example, current methods have problems or bugs in gene selection, classification, clustering and visualization. one important goal of current and future computational analysis methods-short of reverse engineering the entire cell circuitry, should be to reduce that search and help expose the most promising candidates, i.e. gene, proteins, drugs etc; and although current methods already succeed, better accuracy, more robust models and estimators are necessarily welcomed. 8. Hong, H., L. Jiuyong, et al. (2006). A comparative study of classification methods for microarray data analysis. Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61. Sydney, Australia, Australian Computer Society, Inc. Yi LI - 26 - These authors research at University of South Queensland, conduct a series of experiments on DNA Microarray technology using ten-fold cross validation tests. In order to compare the currently used methods, such as SVMs, decision trees, Bagging, Boosting and Random Forest, they use LibSVMs, C4.5, BaggingC4.5, AdaBoostingC4.5, and Random Forest on seven Microarray cancer data sets. The results are indicating that all ensemble methods are outperform C4.5. All five ensemble learning methods benefit from data pre-processing, including gene selection and discretization, as well as classification accuracy. 9. Chaolin, Z., L. Xuesong, et al. (2006). "Significance of Gene Ranking for Classification of Microarray Samples." IEEE/ACM Trans. Comput. Biol. Bioinformatics 3(3): 312-320. Authors of this paper pointed out that evaluating the statistical significance of the gene ranking is important for understanding the results and for further biological investigations; however, they also point out that this question has not been well addressed for machine learning methods in existing works. To improve this problem, authors formulate it in the framework of hypothesis testing and propose a solution based on re-sampling. R-test, which is proposed by authors, converts gene ranking results into position p-values to evaluate the significance of genes. After testing three real microarray data sets and three simulation data sets with support of vector machines, authors point out that the pvalues may help to enable scientists to analyse selection results by sophisticated multivariate methods under the same statistical inference paradigm. 10. Li, S. and T. Eng Chong (2005). "Dimension Reduction-Based Penalized Logistic Regression for Cancer Classification Using Microarray Data." IEEE/ACM Trans. Comput. Biol. Bioinformatics 2(2): 166-175. In this paper, authors present the use of penalized logistic regression for cancer classification using microarray expression data. They introduce two dimension reduction methods, which are combined with the penalized logistic regression to enhance the classification accuracy and computational speed. They also choose two other machine learing methods to make comparison, which are support vector machines and least-squares regression. The advantages of the two new methods are including the explicitly output, selection of penalty parameters and components. They also discuss the application of the methods for cancer classification. Yi LI - 27 - 11. Lei, Y. and L. Huan (2004). Redundancy based feature selection for microarray data. Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. Seattle, WA, USA, ACM. The two authors, who research at Arizona State University, work on redundancy based feature selection for microarray data. In this paper, they point out one problem, which occurred in genes discrimination, is the challenging that the large number of features (genes) and small sample size. Traditional methods always handle this problem without pay attention to the high degree of redundancy among the genes. The authors point out that to remove redundant genes among selected ones can achieve a better representation of the characteristics of the targeted phenotypes and lead to improved or higher classification accuracy. What’s more, in this paper, authors study the relationship between feature relevance and redundancy; they also propose an efficient method that effectively removes redundant genes. The results have been compared with the results of experiments which use public microarray data sets. 12. Sung-Bae, C & Hong-Hee, W 2003, Machine learning in DNA microarray analysis for cancer classification, Australian Computer Society, Inc., Adelaide, Australia. The authors, who work on machine learning, find it can be necessarily implemented into DNA microarray data sets analysis and play an important role in it. In this paper, they attempt to explore many features and classifiers using three benchmark datasets to systematically evaluate the performances of the feature selection methods and machine learning classifiers. Three benchmark datasets are including Leukemia cancer dataset, Colon cancer dataset and Lymphoma cancer dataset. Various techniques and classification algorithms have been involved in this project, which are including cosine coefficient, and information gain for feature selection; k-nearest neighbour, support vector machine and self-organizing map (SOM) are used for classification. According to these experiments they have conducted, they point out that the ensemble learning algorithms with several basis classifiers produces the best recognition rate. 13. Raymond, W, Hiroshi, M & Kiyoko, FA 2005, Cleaning microarray expression data using Markov random fields based on profile similarity, ACM, Santa Fe, New Mexico. Authors, who research at Kyoto University, Japan, proposes a method for cleaning the noise found in microarray expression data sets. This method may improve the data pre-processing process in Yi LI - 28 - microarray research today. This method introduced here is based on Markov random fields, or MRFs, for data set. The cleaning process is guided by genes with similar expression profiles. Their method consists of two steps. In the first step, the expression data is used to infer the profile similarity between each and every gene; in the second step, these similarities are used to construct an MRF of genes and their associated expression values. 14. Tara, M & Sanjay, C 2007, 'High Confidence Rule Mining for Microarray Analysis', IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 4, no. 4, pp. 611-623. In this paper, authors present an association rule mining method for mining high-confidence rules, which describe interesting gene relationships from microarray data sets. They introduce a new family of row-enumeration rule mining algorithms to emerge to facilitate mining in dense data sets. These algorithms rely on pruning infrequent relationships, and are about to reduce the search space by using the support measure. MAXCONF, which is a new method proposed by authors, is used to mine highconfidence rule from microarray data, and it is a support-free algorithm that directly uses the confidence measure to effectively prune the search space. Finally three microarray data sets are used in experiments to show that MAXCONF outperforms support-based rule mining in scalability and rule extraction. 15. Robert, C & Alberto, R 2006, 'A Robust Procedure For Gaussian Graphical Model Search From Microarray Data With <i>p</i> Larger Than <i>n</i>', J. Mach. Learn. Res., vol. 7, pp. 2621-2650. In this paper, authors consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size. Compare to the prime objects of inference full-order partial correlations, which are partial correlations between two variables given the remaining ones, limited-order partial correlations may provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. They also introduce a novel structure learning procedure based on a quantity, which is called non-rejection rate. Experiments have shown that the applicability and usefulness of the procedure are demonstrated by both simulated and read data. Yi LI - 29 - 16. Hongxing, H, Huidong, J, Jie, C, Damien, M, Jiuyong, L & Tony, F 2006, Analysis of breast feeding data using data mining methods, Australian Computer Society, Inc., Sydney, Australia. In this paper, authors research at using data mining methods to analysis breast disease. They aim to demonstrate the benefit of using data mining techniques on survey data where statistical analysis is applied. Ranges of questionnaire have been sent out to collect data on deciding whether or not to breast feed a new born baby. Many typical data mining methods and algorithms, such as decision trees, regression approaches, and information gain, have been used to identify groups with high risk of not breast feeding. The outcome of this study is not only to approve the results of survey is correct, but also to suggest that using data mining approaches will be applicable to other similar survey data. Data mining methods, which enable a search for hypotheses, may be used as a complementary tool to traditional statistical analysis. 17. Yuchun, T, Yan-Qing, Z & Zhen, H 2007, 'Development of Two-Stage SVM-RFE Gene Selection Strategy for Microarray Expression Data Analysis', IEEE/ACM Trans. Comput. Biol. Bioinformatics, vol. 4, no. 3, pp. 365-381. Authors of this paper point out that although there have been many data mining methods implemented in data preparation step in cancer classification, the Support Vector Machine Recursive Feature Elimination (SVM-RFE) algorithm is one of the best gene feature selection algorithms. Differentiate from other related studies, authors of this paper study on a new two-step SVM-RFE algorithm, which is designed to effectively eliminate most of the irrelevant, redundant, and noisy genes while keeping information loss small. In the second stage, it will conduct a fine selection for the final gene subset. The new method, according to the authors, overcomes the instability problem of the SVM-RFE to achieve better algorithm utility. The new two-stage SVM-RFE is more accurate and reliable than the traditional SVM-RFE. 18. Sandrine, D, Mark, JvdL, S, nd, z, K, Annette, MM, Sandra, ES & Siew Leng, T 2003, 'Loss-based estimation with cross-validation: applications to microarray data analysis', SIGKDD Explor. Newsl., vol. 5, no. 2, pp. 56-68. Authors of this paper research at loss-based estimation with cross-validation in applications to microarray data analysis. They propose a unified loss-based methodology for estimator construction, Yi LI - 30 - selection, and performance assessment with cross-validation. To be different from other traditional methods, the parameter of interest is defined as the risk minimizer for a suitable loss function and candidate estimators are generated using this loss function. Based on the definition of parameters, cross-validation then is applied to select an optimal estimator among the candidates, and to assess the overall performance of the resulting estimator. The use of the methodology can be the prediction of biological and clinical outcomes. 19. Miguel, R, Rui, M, Paulo, M, Daniel, G-P & Florentino, F-R 2007, A platform for the selection of genes in DNA microarraydata using evolutionary algorithms, ACM, London, England. Authors research on provide a platform for the selection of genes in DNA microarray data using evolutionary algorithms. They rise the idea that to present a flexible framework to the task of feature selection in classification of DNA microarray data. Evolutionary algorithms, with variable-sized set based representations, are used to reduce the number of attributes. Three distinct classifiers, including 1-nearest neighbour, decision trees and SVMs, are used to be compared in two case studies to demonstrate the outperform of the new platform. 20. Zdravko, M & Ingrid, R 2006, 'An introduction to the WEKA data mining system', SIGCSE Bull., vol. 38, no. 3, pp. 367-368. As the software which is widely used in data mining industry and written in Java, WEKA plays an important role in my project-rules in DNA microarray technology. In this article, the authors give an overview of WEKA data mining system. The goal of the article is to introduce to the packages and pedagogical possibilities for its use. WEKA contains a rich range of powerful Machine Learning algorithms for Data Mining tasks, which is including pre-processing, classification, clustering, and graphical user interface. It introduces the topics covered in WEKA, including data pre-processing and visualization, attribute selection, association rules, classification algorithms (OneR, Decision trees, and covering rules), prediction algorithms, evaluation techniques and clustering (K-means, EM, Cobweb). 21. Saharon, R 2005, Robust boosting and its relation to bagging, ACM, Chicago, Illinois, USA. Yi LI - 31 - The author point out in this paper that boosting and bagging are two approaches to combining weak models in order to build prediction models that are significantly better. He also points out that the general theoretical and practical consensus is that the weak learners for boosting should be really weak, while the weak learners for bagging should actually be strong. He presents an approach of weight decay for observation weights which is equivalent to robustifying the underlying loss function. He also illustrates the practical usefulness of weight decay for improving prediction performance and presents equivalence between one form of weight decay and Huberizing, which is a statistical method for making loss functions more robust. 22. Hong, H, Jiuyong, L, Hua, W, Grant, D & Mingren, S 2006b, A maximally diversified multiple decision tree algorithm for microarray data classification, Australian Computer Society, Inc., Hobart, Australia. In this paper, authors investigate the idea of using diversified multiple trees for Microarray data classification. They propose an algorithm of Maximally Diversified Multiple Trees (MDMT), which makes use of a set of unique trees in the decision committee. By comparing MDMT with some wellknown ensemble methods, including AdaBoost, Bagging, and Random Forests, they find that both MDMT and CS4 are more accurate on average than AdaBoost, Bagging, and Random Forests. To compare those two algorithms, CS4 is capable of finding informative genes and the combinations of informative genes with informative genes, either with less informative genes; MDMT is capable of discovering combinations of informative genes with informative genes, and less informative genes with less informative genes. They both have strengths and weakness. Authors also discuss what improvements they can get. 23. Chen, X, Li, J, Daggard, G & Huang, X 'Finding Similar Patterns in Microarray Data', Lecture notes in computer science, pp. 1272-1276. In this paper, authors propose a clustering algorithm called s-Cluster for analysis of gene expression data based on pattern-similarity. According to the authors, the algorithm captures the right clusters exhibiting strong similar expression patterns in Microarray data. Unlike other algorithms, s-Cluster allows a high level of overlap among discovered clusters without complete grouping them. The fact in biology researches is that not all functions of genes are turned on in an experiment. They apply the sCluster algorithm to yeast Saccharnmyces cerevisiae cell cycle expression data. S-Cluster algorithm is Yi LI - 32 - approved to be a better one which groups genes with strong similar expression patterns and that the found clusters are interpretable. 24. Li, J, Topor, R & Shen, H 2002, 'Construct robust rule sets for classification', Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 564-569. In this paper, authors study the problem of computing classification rule sets from relational databases. Everything has been done in data pre-processing process is to improve the accuracy in predicting without missing attribute values. These traditional methods do not work properly on real data. In other words, they only work for training data, which is perfect and no missing values. The concept introduced in this paper is more robust than others. It is able to make more accurate predictions on test data with missing attribute values. The k-optimal rule set, which is introduced by authors, leads to a hierarchy of k-optimal rule sets in which decreasing size corresponds to decreasing robustness. Two methods to find k-optimal rule sets are introduced then, i.e. an optimal association rule mining approach and a heuristic approximate approach. Authors use experiments to approve that k-optimal rule set have better performance than a typical classification rule set on test data. 25. Li, J, Shen, H & Topor, R 2004, 'Mining Informative Rule Set for Prediction', Journal of Intelligent Information Systems, vol. 22, no. 2, pp. 155-174. In this paper, authors define a new rule set, called informative rule set, for mining transaction databases. Compared with the traditional rule, the new one is much smaller than it, but makes the same predictions. The large numbers of rules, which are generated by traditional association rule, are unnecessary. The advantages of informative rule are: it is not constrained to particular target items; and it is smaller than the non-redundant association rule set. In this paper, authors also present an algorithm to directly generate the informative rule set without generating all frequent itemsets firstly. They prove their statement by using a series of experiments to show how the informative rule set is smaller and can be efficiency. 26. Diaz-Uriarte, R & Alvarez de Andres, S 2006, 'Gene selection and classification of microarray data using random forest', BMC Bioinformatics, vol. 7, no. 3. Authors firstly point out that most researchers try to identify the smallest possible set of genes that can still achieve good predictive performance. Random forest is a classification algorithm suits for Yi LI - 33 - microarray data, because it shows excellent performance even when most predictive variables are noise; and it can be used when the number of variables is much larger than the number of observations. Authors study for the use of random forest for classification of microarray data, and propose a new method of gene selection in classification problems based on random forest. By using simulated and nine microarray data sets, authors show that the random forest has comparable performance to other classification methods, including DLDA, KNN, and SVM. 27. Yang, YH, Dudoit, S, Luu, P, Lin, DM, Peng, V, Ngai, J & Speed, TP 2002, 'Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation', Nucleic Acids Research, vol. 30, no. 4, p. e15. In this paper, author firstly point out that in cDNA microarray experiments, systematic variation often affect the measured gene expression levels. The term normalization refers to the process of removing such variation. Researchers are trying to adjust the distribution of the intensity log ratios to have a median of zero for each slide. In this paper, authors propose normalization methods that are based on robust local regression and account for intensity and spatial dependence in dye biases for different types of cDNA microarray experiments. Lastly, according to author, “a robust method based on maximum likelihood estimation is proposed to adjust for scale differences among slides”. 28. Bozinov, D & Rahnenfuhrer, J 2002, Unsupervised technique for robust target separation and analysis of DNA microarray spots through adaptive pixel clustering, Oxford Univ Press, pp. 747-756. Due to the characteristic imperfections, microarray images always challenge the existing analytical methods. Problems such as irregular contours, donut shapes, artifacts, force people to propose a new approach to ensure accurate data extraction from these images. The authors introduce a novel method for intensity assessment of gene spots. It is based on clustering pixels of a target area into foreground and background. Two clustering algorithm, k-means and Partitioning Around Medoids (PAM), are used to produce the new method. The results of implementing the new method show that PX(PAM) and PX(KMEANS) are have high robust against other various types. According to the authors, the implementation of this method is a combination of two complementary tools Extractiff (Java) and Pixclust. Yi LI - 34 - 29. Liu, L, Hawkins, DM, Ghosh, S & Young, SS 2003, 'Robust Singular Value Decomposition Analysis of Microarray Data', Proceedings of the National Academy of Sciences of the United States of America, vol. 100, no. 23, pp. 13167-13172. The authors research at University of California, Berkeley USA, are interested in working out a statistical technique to help discern possible patterns in biological samples in microarray data. They find a technique which applies a combination of mathematical and statistical methods to progressively take the data set apart, so that to allow different aspects can be examined for both general patterns as well as specific effects. Due to the extreme values (outliers), missing values, and abnormal values, authors develop a robust analysis method to deal with the problem. The benefits of this method are including the understanding of large-scale shifts and the isolation of particular sample-by-gene. 30. Bolshakova, N, Azuaje, F & Cunningham, P 2005, An integrated tool for microarray data clustering and cluster validity assessment, Oxford Univ Press, pp. 451-455. The authors, research at Trinity College Dublin, work on a data mining system, which allows the application of multiple clustering and cluster validity algorithms for DNA microarray data. This tool, called Machaon CVE system, not only can be used to do clustering, also may help evaluation of the clustering scheme or cluster validation. Five validation and two clustering techniques have been implemented in this system. In future, it can improve the quality of dataset analysis outcomes, and may support the prediction of the relevancy clusters in the microarray area. This systematic evaluation approach would significantly aid genome expression analyses for knowledge discovery applications. Its clustering and validating functionalities may not only be used in DNA microarray expression analysis applications, but also other biomedical and physical data with no limitations. Yi LI - 35 - Appendix-A Diagram (Babu, 2004b) of the process of how microarray slides is extracted from the living cell. (A) Microarrays is usually a glass or polymer slides, onto which DNA molecules are attached at fixed locations called spots or features. Each spot contains oligonucleotide sequence or genomic DNA that uniquely represents a gene. (B) Two groups of microarray samples are under test condition A and normal condition B, respectively. The messenger RNA (mRNA) is extracted from the cells since it brings all gene information that will express out. After labeling with dyes, cDNA is generated. Yi LI - 36 - Appendix-B Diagram provided by Becquet et al (2002). Yi LI - 37 -