Download Lab 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Pathogenomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Minimal genome wikipedia , lookup

Ridge (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Lab 3 – Feature Selection with Genetic Data (source: KDnuggets)
In Lab 2 we cleaned the genetic data from the work done by Todd Golub et al at MIT Whitehead Institute (now
MIT Broad Institute) on Leukemia. The work is described in their paper Molecular Classification of Cancer:
Class Discovery and Class Prediction by Gene Expression Monitoring (pdf).
The original data contained 3 files:
table_ALL_AML_samples.txt
data_set_ALL_AML_train.txt
data_set_ALL_AML_independent.txt
and can be found here:
http://www.kdnuggets.com/data_mining_course/data/ALL_AML_original_data.zip
The cleaning involved removing the control genes and the call fields and changing the
“Gene.Accession.Number” attribute title to “ID”. The data was then transposed, normalized (to be between 20
and 16,000), and merged by “ID” with ALL/AML classification information from a separate file. This was done
for both the training and the test data sets and the results were exported as a csv file.
We will start where we left off (at the “bonus” question at the end of lab 2).
1. Import or re-import the cleaned data set.
You may use the file you created or import the file below. The file below has been transposed back to genes
as rows. This will be easier for the first analysis. If you use your own file, you may wish to transpose it.
A cleaned version of the gene data provided by KDnuggets can be found here:
http://www.kdnuggets.com/data_mining_course/data/ALL_AML_train_processed.zip
Import the file “ALL_AML_gr.thr.train.csv”. In the code below, the name “cleaned” will refer to the
imported data frame.
2. Examine the gene variation. For the following, just use the “train” data set. You can use the code
provided or write your own.
a. Compute a fold difference for each gene. Fold difference is the maximum value across samples divided
by the minimum value. This value is frequently used by biologists to assess gene variability.
You can add a “fold column” at the end with:
for(i in 1:7070){cleaned[i,40]=max(cleaned[i,2:39])/min(cleaned[i,2:39])}
names(cleaned)[40]="fold"
b. What is the largest fold difference and how many genes have it?
c. What is the lowest fold difference and how many genes have it?
You can use the following to get the frequency of any value, for example 1:
table(cleaned$fold==1)
d. Count how many genes have fold ratio in the following ranges
Range
Count
Val <= 2
2 <Val <= 4
4 <Val <= 8
8 <Val <= 16
16 <Val <= 32
32 <Val <= 64
64 <Val <= 128
128 <Val <= 256
256 <Val <= 512
512 <Val
cleaned2<-cbind(cleaned,fold.group=cut(cleaned$fold,breaks=c(2,4,8,16,32,64,128,256,512)))
e. Create some graph displaying the fold ratio distribution appropriately. Comment on the gene
variability.
3. Finding the most significant genes. For train set, samples 1-27 belong to class ALL, and 28-38 to class
AML.
Let Avg1, Avg2 be the average expression values for ALL and AML, respectively.
Let Stdev1, Stdev2 be the sample standard deviations for ALL and AML, respectively.
Signal to Noise (S2N) ratio is defined as (Avg1 - Avg2) / (Stdev1 + Stdev2)
T-value is defined as (Avg1 - Avg2) / sqrt(Stdev1*Stdev/N1 + Stdev2*Stdev2/N2)
Note: as defined, the higher the S2N and T-value, the higher the gene correlation to ALL (the lower –
AML).
Create these 6 quantities by adding columns to the end of cleaned.
For row standard deviation you might want this package:
install.packages("matrixStats")
library(matrixStats)
for(i in 1:7070){cleaned[i,41]=rowMeans(cleaned[i,2:28])}
for(i in 1:7070){cleaned[i,42]=rowSds(cleaned[i,2:28])}
names(cleaned)[41]="Avg1"
names(cleaned)[42]="SD1"
Similarly for Avg2 and SD2.
for(i in 1:7070){cleaned[i,45]=(cleaned$Avg1[i]-cleaned$Avg2[i])/(cleaned$SD1[i]+cleaned$SD2[i])}
names(cleaned)[45]= "S2N"
for(i in 1:7070){cleaned[i,46]=(cleaned$Avg1[i]cleaned$Avg2[i])/sqrt(cleaned$SD1[i]*cleaned$SD1[i]/27+cleaned$SD2[i]*cleaned$SD2[i]/11)}
names(cleaned)[46]= "T.value"
a. Select for each class, top 20 genes with the highest S2N ratio.
You can sort columns using: cleaned2<-cleaned[order(cleaned$S2N),]
b. Select for each class top 20 genes with the highest T-value.
c. How many genes are in common between top 20 genes for ALL selected using S2N and those
selected using T-value ? How many genes are in common among top 3 genes in each list?
d. Same question for top genes for "AML".
cleaned2<-cleaned[order(cleaned$S2N),]
cleaned[1:20,1]
cleaned3<-cleaned[order(cleaned$T.value),]
cleaned3[1:20,1]
4. For fun bonus: Build a classification model on the full data set.
Open Weka. Click on “Explorer”.
a. In the “preprocess” tab you can open one of the gene files (your own or these on e-reserve) with genes
in columns:
ALL_AML_gr.thr.train.t.csv (the cleaned data)
genes-leukemia.csv (a smaller data set 40 only 40 attributes based on the feature selection)
Click on the attribute name to see a summary on the right. Click on “Visualize All” to see a summary of
all of the attributes at once. You can remove attributes by clicking on them and the pressing “Remove”.
You will want to remove (or at least exclude) the ID field.
b. Go to classify and try building a model using OneR, NaïveBayes Simple, or J4.8. Make sure to choose
“CLASS” before you click “Start”. You have test options such as “Use training set” (which works well
if you also have a test set to supply) or “Percentage Split” (for data sets that you want Weka to split into
a training and test set for you.)
c. You can also try the “Select attributes tab” to narrow down the attributes for you.
d. Comment on any interesting findings or errors and the attributes that are selected.