Download Lab 2

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Lab 2 – Data Preprocessing with Genetic Data (source: KDnuggets)
The process of data mining frequently has many small steps that all need to be done correctly to get good
results. However tedious these steps may seem, the goal is often a worthy one. With this particular data the
goal is to help make an early diagnosis for leukemia, a common form of cancer, and one should take care to
process it carefully.
We will be using Excel and R to explore and preprocess a genetic data set, though you should feel free to use
software that you are familiar with. Though R is not useful for certain pieces, it’s benefit lies in the ability to
explore the data during and after preprocessing.
The data for this lab comes from the pioneering work by Todd Golub et al at MIT Whitehead Institute (now
MIT Broad Institute). The analysis done by the MIT biologists is described in their paper Molecular
Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (pdf).
The data is contained in a zip file here:
It contains 3 files –
The “samples” file has sample collection, treatment, and diagnosis information. The “train” and
“independent” text files contain gene information for each of the samples. Both train and test (“independent”)
datasets are tab-delimited files with 7130 records. The train file should have 78 fields and test 70 fields. The
first two fields are Gene Description (a long description like GB DEF = PDGFRalpha protein) and Gene
Accession Number (a short name like X95095_at). The remaining fields are pairs of a sample number (e.g.
1,2,..38) and an Affymetix "call" (P is gene is present, A if absent, M if marginal).
You can think of the training data as a very tall and narrow table with 7130 rows and 78 columns, but you
should note that it is "sideways" from machine learning point of view - the attributes (genes) are in rows, and
observations (samples) are in columns. This is the standard format for microarray data, but to use with
machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to make files
with genes in columns and samples in rows.
Here is a small extract :
Gene Description
Gene Accession Number
GB DEF = GABAa receptor alpha-3
Since we will use the training set to create the class predictor and the test set to check it, both sets need to be
cleaned. In general, it is good practice to create separate files for each step.
After each step, report the number of fields and records. When in R, you can do this with nrow and ncol
commands. Additionally include what the first few rows look like using the “head” R command (when
You may also wish to export the files to text or csv files between steps.
1. Obtain the zip file containing the data. Unzip the data and rename the train file to
ALL_AML_grow.train.orig.txt and test file to ALL_AML_grow.test.orig.txt
Convention: use the same file root for files of similar type and use different extensions for different versions
of these files. Here "orig" stands for original input files and "grow" stands for genes in rows. We will use
extension .tmp for temporary files that are typically used for just one step in the process and can be deleted
Data Reduction in Excel: (Note, you can do this in R using commands such as grep.)
Open the training set txt file to see what the data looks like. Select all of the data and copy it into Excel.
Excel should recognize that the data is tab delimited and properly place the data columns into separate
columns in Excel. If not, you may need to “paste special” and choose delimited.
2. Classify the attributes as binary, discrete, and continuous. Additionally classify them as qualitative
(nominal or ordinal) or quantitative (interval or ratio).
3. Initial Microarray Cleaning Steps
a. Remove the initial records with Gene Description containing "control". (Those are Affymetrix
controls, not human genes). Call the resulting files ALL_AML_grow.train.noaffy.tmp
How many control records did you delete?
b. Remove the first field (long description) and the "call" fields, i.e. keep fields numbered 2,3,5,7,9,...
c. Transpose the data. (Note this could be done in R using transposed.frame <- t(old.frame), though
care has to be taken due to changes in variable names.)
Use extensions gcol.test.tmp and gcol.train.tmp ("gcol" stands for genes in columns). These files
should each have 7071 fields, and 39 records in "train", 35 records in "test" datasets.
d. Export the file to a tab delimited txt file.
Preprocessing in R:
To read a text file into R, use the following command:
framename <- read.table(“file_name.txt”, header=T, sep=”\t”)
Note: You may need to change your working directory.
If it is imported without a problem, R will not say anything after the command. To check the data and see
what the beginning looks like type into the command prompt head(framename).
Unless you take any special action, read.table reads all the columns as character vectors and then tries to select
a suitable class for each variable in the data frame. It tries in turn logical, integer, numeric and complex,
moving on if any entry is not missing and cannot be converted. If all of these fail, the variable is converted to
a factor.
4. Import the data. Notice that R renamed the attributes. Describe what happened.
5. Inspect the data you just imported with the following commands:
nrow(framename) = number of rows (excluding the header)
ncol(framename) = number of columns
attributes(framename) gives the column and row names and the data class.
summary(framename) gives the count by value for qualitative attributes and five-number summary
for quantitative attributes
6. Microarray Cleaning Continued
a. Change "Gene.Accession.Number" to "ID" in the first record. There are a couple of ways to do this.
One is to create a variable called “ID” with the same values and then delete
Another way is to install the package “plyr” and then use the following command:
framename <- rename(framename, c(oldname="newname"))
To download and install a package, use the following commands:
b. Normalize the data: for each value, set the minimum field value to 20 and the maximum to 16,000.
(Note: The expression values less than 20 or over 16,000 were considered by biologists unreliable for
this experiment.)
Indicate that you have done this with the extensions grow.train.norm.tmp and grow.test.norm.tmp
Sample R code: normal <- df(df$ID, replace(df[,2:7071],df[,2:7071]<20,20), df$ID.class)
c. Import the file table_ALL_AML_samples.txt (you may want to split it into 2 text files – one for
“train” and one for “independent”). You will need to rename the first column to be ID, and you
probably want to rename the second as well (for example ALL.AML or ID.class). We will only be
using the first 2 columns/attributes from this file, so you have the option of altering the text file before
importing or only keeping the first 2 columns once imported (using something like new.df <subset(df, select = c(``ID’’,``ID_class’’)) ). Indicate with idclass.train.txt and idclass.test.txt.
d. Merge the files you just created with the appropriate gene data files by ID so that the ALL/AML class
information is now available. To merge data frames in R use:
New.train <- merge(df1 , df2 , by="ID", all=TRUE)
How many of the individuals are classifies as ALL and how many as AML in each of the training and
test files?
e. Export the file as a csv file.
Here is R code for exporting data:
write.table(, file = "filename.txt", append = FALSE, quote = FALSE, sep = " ",
eol = "\n", na = "NA", dec = ".", row.names = TRUE,
col.names = TRUE)
The data is now clean enough to build a model. We will do this later.
7. (Bonus) Should you make it through all of the cleaning, here are some questions asking you to examine
the gene variation. For the following, just use the “train” data set.
a. Compute a fold difference for each gene. Fold difference is the maximum value across samples
divided by minimum value. This value is frequently used by biologists to assess gene variability.
b. What is the largest fold difference and how many genes have it?
c. What is the lowest fold difference and how many genes have it?
d. Count how many genes have fold ratio in the following ranges
Val <= 2
2 <Val <= 4
4 <Val <= 8
8 <Val <= 16
16 <Val <= 32
32 <Val <= 64
64 <Val <= 128
128 <Val <= 256
256 <Val <= 512
512 <Val
e. Graph fold ratio distribution appropriately.