Download Lab 3

Lab 3 – Feature Selection with Genetic Data (source: KDnuggets) In Lab 2 we cleaned the genetic data from the work done by Todd Golub et al at MIT Whitehead Institute (now MIT Broad Institute) on Leukemia. The work is described in their paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (pdf). The original data contained 3 files: table_ALL_AML_samples.txt data_set_ALL_AML_train.txt data_set_ALL_AML_independent.txt and can be found here: http://www.kdnuggets.com/data_mining_course/data/ALL_AML_original_data.zip The cleaning involved removing the control genes and the call fields and changing the “Gene.Accession.Number” attribute title to “ID”. The data was then transposed, normalized (to be between 20 and 16,000), and merged by “ID” with ALL/AML classification information from a separate file. This was done for both the training and the test data sets and the results were exported as a csv file. We will start where we left off (at the “bonus” question at the end of lab 2). 1. Import or re-import the cleaned data set. You may use the file you created or import the file below. The file below has been transposed back to genes as rows. This will be easier for the first analysis. If you use your own file, you may wish to transpose it. A cleaned version of the gene data provided by KDnuggets can be found here: http://www.kdnuggets.com/data_mining_course/data/ALL_AML_train_processed.zip Import the file “ALL_AML_gr.thr.train.csv”. In the code below, the name “cleaned” will refer to the imported data frame. 2. Examine the gene variation. For the following, just use the “train” data set. You can use the code provided or write your own. a. Compute a fold difference for each gene. Fold difference is the maximum value across samples divided by the minimum value. This value is frequently used by biologists to assess gene variability. You can add a “fold column” at the end with: for(i in 1:7070){cleaned[i,40]=max(cleaned[i,2:39])/min(cleaned[i,2:39])} names(cleaned)[40]="fold" b. What is the largest fold difference and how many genes have it? c. What is the lowest fold difference and how many genes have it? You can use the following to get the frequency of any value, for example 1: table(cleaned$fold==1) d. Count how many genes have fold ratio in the following ranges Range Count Val <= 2 2 <Val <= 4 4 <Val <= 8 8 <Val <= 16 16 <Val <= 32 32 <Val <= 64 64 <Val <= 128 128 <Val <= 256 256 <Val <= 512 512 <Val cleaned2<-cbind(cleaned,fold.group=cut(cleaned$fold,breaks=c(2,4,8,16,32,64,128,256,512))) e. Create some graph displaying the fold ratio distribution appropriately. Comment on the gene variability. 3. Finding the most significant genes. For train set, samples 1-27 belong to class ALL, and 28-38 to class AML. Let Avg1, Avg2 be the average expression values for ALL and AML, respectively. Let Stdev1, Stdev2 be the sample standard deviations for ALL and AML, respectively. Signal to Noise (S2N) ratio is defined as (Avg1 - Avg2) / (Stdev1 + Stdev2) T-value is defined as (Avg1 - Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev2*Stdev2/N2) Note: as defined, the higher the S2N and T-value, the higher the gene correlation to ALL (the lower – AML). Create these 6 quantities by adding columns to the end of cleaned. For row standard deviation you might want this package: install.packages("matrixStats") library(matrixStats) for(i in 1:7070){cleaned[i,41]=rowMeans(cleaned[i,2:28])} for(i in 1:7070){cleaned[i,42]=rowSds(cleaned[i,2:28])} names(cleaned)[41]="Avg1" names(cleaned)[42]="SD1" Similarly for Avg2 and SD2. for(i in 1:7070){cleaned[i,45]=(cleaned$Avg1[i]-cleaned$Avg2[i])/(cleaned$SD1[i]+cleaned$SD2[i])} names(cleaned)[45]= "S2N" for(i in 1:7070){cleaned[i,46]=(cleaned$Avg1[i]cleaned$Avg2[i])/sqrt(cleaned$SD1[i]*cleaned$SD1[i]/27+cleaned$SD2[i]*cleaned$SD2[i]/11)} names(cleaned)[46]= "T.value" a. Select for each class, top 20 genes with the highest S2N ratio. You can sort columns using: cleaned2<-cleaned[order(cleaned$S2N),] b. Select for each class top 20 genes with the highest T-value. c. How many genes are in common between top 20 genes for ALL selected using S2N and those selected using T-value ? How many genes are in common among top 3 genes in each list? d. Same question for top genes for "AML". cleaned2<-cleaned[order(cleaned$S2N),] cleaned[1:20,1] cleaned3<-cleaned[order(cleaned$T.value),] cleaned3[1:20,1] 4. For fun bonus: Build a classification model on the full data set. Open Weka. Click on “Explorer”. a. In the “preprocess” tab you can open one of the gene files (your own or these on e-reserve) with genes in columns: ALL_AML_gr.thr.train.t.csv (the cleaned data) genes-leukemia.csv (a smaller data set 40 only 40 attributes based on the feature selection) Click on the attribute name to see a summary on the right. Click on “Visualize All” to see a summary of all of the attributes at once. You can remove attributes by clicking on them and the pressing “Remove”. You will want to remove (or at least exclude) the ID field. b. Go to classify and try building a model using OneR, NaïveBayes Simple, or J4.8. Make sure to choose “CLASS” before you click “Start”. You have test options such as “Use training set” (which works well if you also have a test set to supply) or “Percentage Split” (for data sets that you want Weka to split into a training and test set for you.) c. You can also try the “Select attributes tab” to narrow down the attributes for you. d. Comment on any interesting findings or errors and the attributes that are selected.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lab 3