Download Gene selection: choice of parameters of the GA/KNN method

Gene selection: choice of parameters of the GA/KNN method January 9, 2002 Kim Hye Jin Intelligent Multimedia Lab. [email protected] Contents  Methods      Data Sets Methodology : k-NN, Genetic Algorithm Parameters: Sensitivity, reproducibility, and stablility Result  Discussion Methods  Data sets  Lymphoma data  4026 genes  47 samples ( original D: 34 training/13 test )  Colon data  2000 genes  57 samples ( original D: 40 training/ 17 test ) K-nearest neighbor Genetic algorithm K-nearest neighbors    K = 3 (default) d genes Rules by Euclidean distance  Consensus rule   decide if All 3 belong to the same class Majority rule  Decide if 2 out of 3 belong to the same class Genetic Algorithm Selection mutation chromosome N : dimension of chromosome / the number of genes in each chromosome f i : fitness function - all k membership agrees to the solution assign 1 to the gene - the scores are summed and divided by M ( the # of samples in training) Genetic Algorithm  Selection among chromosomes    Survival of the fittest principle The single best chromosome from each niche is entered into the respective subsequent niche deterministically The remains are filled according to the relative fitness of the chromosome Genetic Algorithm  Mutation  2. Evolvability by introducing new genes Which chromosome? By a probability proportional to its fitness rank How many genes? Among 1 ~ 5, the number of mutations is assigned randomly with prob. 0.53125, 0.25 0.125, 0.0625, and 0.03125 3. Which genes? 1.  1 2 3 4 5 Randomly selected and replaced randomly from the genes not already in the chromosome Stop : 10000 high-R2 chromosomes are obtained Parameters  Sensitivity   Reproduciblility   Choice of d : 5, 10, 20, 30, 40, 50 Independent re-runs of the GA/KNN method on the same data Stability Reassignments of ‘training’ and ‘test’ sets: Original/ random/ discrepant  Result    Sensitivity Reproducibility Stability Sensitivity(1)  Sensitivity   A few genes dominate the selection when d is small As d increases, more peaks arise and the pattern of gene selection stabilizes Sensitivity(2)  Gene selection is insensitive to choices of d between 20~50 Sensitivity(3)  Classification of the test set sample : classification is insensitive to the choice of d Reproducibility  Reproducibility   Repeat the same GA/KNN procedure on the same training set with different seed numbers Reproducibility is high for all choice of d Stability(1)  Stability   Selection of optimal genes is insensitive to this choice on the sensitivity and reproducibility with d = 40 For the set :Original/ random/ discrepant  Original : randomly shuffled  Random: randomly chose N samples from the whole data set  Discrepant : the last N Stability(2)  colon Gene selection  25~37 of the top 50 genes appear in both. random discrepant original original lymphoma random discrepant original original Stability(3)  Classification of test samples in lymphoma   Original : 2 errors Random/discrepant : 0 errors Discussion(1)  Choice of d    As d increase, the pattern of gene selection is stabilized. d in 20 ~ 50 gave the best result Choice for the termination R2    R2 = (M-1)/M or (M-2)/M Little effect on the selection Computationally more rapid Discussion(2)   The choice of the number of top genes for classification : 50~ 200 Information/noise content     Lymphoma data case Consensus rule : 31% Majority rule : 61% Much of the data does not contribute information

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Gene selection: choice of parameters of the GA/KNN method