* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Gene selection: choice of parameters of the GA/KNN method
Metagenomics wikipedia , lookup
Gene desert wikipedia , lookup
Human genetic variation wikipedia , lookup
Point mutation wikipedia , lookup
Oncogenomics wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Neocentromere wikipedia , lookup
Heritability of IQ wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Essential gene wikipedia , lookup
Pathogenomics wikipedia , lookup
Group selection wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Skewed X-inactivation wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Y chromosome wikipedia , lookup
Public health genomics wikipedia , lookup
Genome evolution wikipedia , lookup
Population genetics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
The Selfish Gene wikipedia , lookup
History of genetic engineering wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
X-inactivation wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Gene expression profiling wikipedia , lookup
Designer baby wikipedia , lookup
Gene expression programming wikipedia , lookup
Gene selection: choice of parameters of the GA/KNN method January 9, 2002 Kim Hye Jin Intelligent Multimedia Lab. [email protected] Contents Methods Data Sets Methodology : k-NN, Genetic Algorithm Parameters: Sensitivity, reproducibility, and stablility Result Discussion Methods Data sets Lymphoma data 4026 genes 47 samples ( original D: 34 training/13 test ) Colon data 2000 genes 57 samples ( original D: 40 training/ 17 test ) K-nearest neighbor Genetic algorithm K-nearest neighbors K = 3 (default) d genes Rules by Euclidean distance Consensus rule decide if All 3 belong to the same class Majority rule Decide if 2 out of 3 belong to the same class Genetic Algorithm Selection mutation chromosome N : dimension of chromosome / the number of genes in each chromosome f i : fitness function - all k membership agrees to the solution assign 1 to the gene - the scores are summed and divided by M ( the # of samples in training) Genetic Algorithm Selection among chromosomes Survival of the fittest principle The single best chromosome from each niche is entered into the respective subsequent niche deterministically The remains are filled according to the relative fitness of the chromosome Genetic Algorithm Mutation 2. Evolvability by introducing new genes Which chromosome? By a probability proportional to its fitness rank How many genes? Among 1 ~ 5, the number of mutations is assigned randomly with prob. 0.53125, 0.25 0.125, 0.0625, and 0.03125 3. Which genes? 1. 1 2 3 4 5 Randomly selected and replaced randomly from the genes not already in the chromosome Stop : 10000 high-R2 chromosomes are obtained Parameters Sensitivity Reproduciblility Choice of d : 5, 10, 20, 30, 40, 50 Independent re-runs of the GA/KNN method on the same data Stability Reassignments of ‘training’ and ‘test’ sets: Original/ random/ discrepant Result Sensitivity Reproducibility Stability Sensitivity(1) Sensitivity A few genes dominate the selection when d is small As d increases, more peaks arise and the pattern of gene selection stabilizes Sensitivity(2) Gene selection is insensitive to choices of d between 20~50 Sensitivity(3) Classification of the test set sample : classification is insensitive to the choice of d Reproducibility Reproducibility Repeat the same GA/KNN procedure on the same training set with different seed numbers Reproducibility is high for all choice of d Stability(1) Stability Selection of optimal genes is insensitive to this choice on the sensitivity and reproducibility with d = 40 For the set :Original/ random/ discrepant Original : randomly shuffled Random: randomly chose N samples from the whole data set Discrepant : the last N Stability(2) colon Gene selection 25~37 of the top 50 genes appear in both. random discrepant original original lymphoma random discrepant original original Stability(3) Classification of test samples in lymphoma Original : 2 errors Random/discrepant : 0 errors Discussion(1) Choice of d As d increase, the pattern of gene selection is stabilized. d in 20 ~ 50 gave the best result Choice for the termination R2 R2 = (M-1)/M or (M-2)/M Little effect on the selection Computationally more rapid Discussion(2) The choice of the number of top genes for classification : 50~ 200 Information/noise content Lymphoma data case Consensus rule : 31% Majority rule : 61% Much of the data does not contribute information