Download slides

Genetic algorithms applied to multi-class prediction for the analysis of gene expressions data C.H. Ooi & Patrick Tan Presentation by Tim Hamilton “Genechips” • DNA microarrays – a collection of microscopic DNA spots representing single genes. • Commonly used to monitor expression levels of thousands of genes at once. Classification • Gene expression data is commonly used in the classification of a biological sample. - Tumor subtypes - Response to certain types of treatment (e.g. chemotherapy). • Most approaches focus on classification of two, or at most three classes, and have high rates of error when run on sets containing multiple classes (19%) • Propose using GA for analyzing multiple-class expression data. • Reduced performance of previous rank-based approaches because of: 1) missing correlations between genes. 2) Predictor set size must be specified. • Data Sets used for the GA: – NCI60: expression profiles of 64 cancer cell lines containing 9703 cDNA sequences. – GCM: expression profiles for 198 tumor samples, 90 normal samples, and 20 unknowns containing 16063 genes. – Both data sets were pre-processed to generate a truncated 1000-gene dataset, color ratio of a single spot – color ration of all spots / standard deviation. Kept the genes with the highest standard deviation. Choosing a GA chromosome • Determine some minimum and maximum gene range for selection. [Rmin, Rmax] • Chromosome string: [R g1 g2… gRmax ] - R is the size of the predictive set - any genes past length R are ignored. - genes are chosen from the list of 1000. Parameters • Population size: 100 • Generations: 100 Other parameters were varied • Crossover method: one-point or universal • Selection method: stochastic universal sampling (SUS) or roulette wheel selection (RWS) • Probability of Crossover : 0.7 – 1.0 • Probability of mutation: 0.0005 – 0.01 • Predictor set size range [Rmin, Rmax]: [5, 10], [11, 15], [16, 20], [21, 25], [26,30]; • For each predictor set size this produced 96 different runs • Run on both the truncated set, and the full data set for comparison. • Each generation of chromosomes is used to classify the data sets using a maximum likelihood (MLHD) method. • Fitness = 200 – (E1 + E2) • E1 = cross validation error rate • E2 = independent test error rate. • The MLHD classifier involves a lot of math, but is based upon Bayes Rule • Used two previous rank-based methods on the same truncated data set for comparison. Results • • • Uniform crossover produced the best predictors in size ranges [11,15] and [16,20] One-point crossover best in ranges [5,10], [21,25] and [26,30]. Higher predictive accuracies when run against the truncated data set. Results vs. Other Methods • Finally, GA compared to another method using SVM classification. • The SVM had best performance when all 16063 genes of a data-set were used, 22% error • The GA used only 32 elements, 18% error.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides