Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lab 2 – Data Preprocessing with Genetic Data (source: KDnuggets) The process of data mining frequently has many small steps that all need to be done correctly to get good results. However tedious these steps may seem, the goal is often a worthy one. With this particular data the goal is to help make an early diagnosis for leukemia, a common form of cancer, and one should take care to process it carefully. We will be using Excel and R to explore and preprocess a genetic data set, though you should feel free to use software that you are familiar with. Though R is not useful for certain pieces, it’s benefit lies in the ability to explore the data during and after preprocessing. The data for this lab comes from the pioneering work by Todd Golub et al at MIT Whitehead Institute (now MIT Broad Institute). The analysis done by the MIT biologists is described in their paper Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring (pdf). The data is contained in a zip file here: http://www.kdnuggets.com/data_mining_course/data/ALL_AML_original_data.zip It contains 3 files – table_ALL_AML_samples.txt data_set_ALL_AML_train.txt data_set_ALL_AML_independent.txt The “samples” file has sample collection, treatment, and diagnosis information. The “train” and “independent” text files contain gene information for each of the samples. Both train and test (“independent”) datasets are tab-delimited files with 7130 records. The train file should have 78 fields and test 70 fields. The first two fields are Gene Description (a long description like GB DEF = PDGFRalpha protein) and Gene Accession Number (a short name like X95095_at). The remaining fields are pairs of a sample number (e.g. 1,2,..38) and an Affymetix "call" (P is gene is present, A if absent, M if marginal). You can think of the training data as a very tall and narrow table with 7130 rows and 78 columns, but you should note that it is "sideways" from machine learning point of view - the attributes (genes) are in rows, and observations (samples) are in columns. This is the standard format for microarray data, but to use with machine learning algorithms like WEKA, we will need to do "matrix transpose" (flip) the matrix to make files with genes in columns and samples in rows. Here is a small extract : Gene Description Gene Accession Number 1 call 2 call ... GB DEF = GABAa receptor alpha-3 subunit A28102_at 151 A 263 P ... ... AB000114_at 72 A 21 A ... ... AB000115_at 281 A 250 P ... ... AB000220_at 36 A 43 A ... Since we will use the training set to create the class predictor and the test set to check it, both sets need to be cleaned. In general, it is good practice to create separate files for each step. After each step, report the number of fields and records. When in R, you can do this with nrow and ncol commands. Additionally include what the first few rows look like using the “head” R command (when appropriate). You may also wish to export the files to text or csv files between steps. 1. Obtain the zip file containing the data. Unzip the data and rename the train file to ALL_AML_grow.train.orig.txt and test file to ALL_AML_grow.test.orig.txt Convention: use the same file root for files of similar type and use different extensions for different versions of these files. Here "orig" stands for original input files and "grow" stands for genes in rows. We will use extension .tmp for temporary files that are typically used for just one step in the process and can be deleted later. Data Reduction in Excel: (Note, you can do this in R using commands such as grep.) Open the training set txt file to see what the data looks like. Select all of the data and copy it into Excel. Excel should recognize that the data is tab delimited and properly place the data columns into separate columns in Excel. If not, you may need to “paste special” and choose delimited. 2. Classify the attributes as binary, discrete, and continuous. Additionally classify them as qualitative (nominal or ordinal) or quantitative (interval or ratio). 3. Initial Microarray Cleaning Steps a. Remove the initial records with Gene Description containing "control". (Those are Affymetrix controls, not human genes). Call the resulting files ALL_AML_grow.train.noaffy.tmp How many control records did you delete? b. Remove the first field (long description) and the "call" fields, i.e. keep fields numbered 2,3,5,7,9,... c. Transpose the data. (Note this could be done in R using transposed.frame <- t(old.frame), though care has to be taken due to changes in variable names.) Use extensions gcol.test.tmp and gcol.train.tmp ("gcol" stands for genes in columns). These files should each have 7071 fields, and 39 records in "train", 35 records in "test" datasets. d. Export the file to a tab delimited txt file. Preprocessing in R: To read a text file into R, use the following command: framename <- read.table(“file_name.txt”, header=T, sep=”\t”) Note: You may need to change your working directory. If it is imported without a problem, R will not say anything after the command. To check the data and see what the beginning looks like type into the command prompt head(framename). Unless you take any special action, read.table reads all the columns as character vectors and then tries to select a suitable class for each variable in the data frame. It tries in turn logical, integer, numeric and complex, moving on if any entry is not missing and cannot be converted. If all of these fail, the variable is converted to a factor. 4. Import the data. Notice that R renamed the attributes. Describe what happened. 5. Inspect the data you just imported with the following commands: nrow(framename) = number of rows (excluding the header) ncol(framename) = number of columns attributes(framename) gives the column and row names and the data class. summary(framename) gives the count by value for qualitative attributes and five-number summary for quantitative attributes 6. Microarray Cleaning Continued a. Change "Gene.Accession.Number" to "ID" in the first record. There are a couple of ways to do this. One is to create a variable called “ID” with the same values and then delete “Gene.Accession.Number” Another way is to install the package “plyr” and then use the following command: framename <- rename(framename, c(oldname="newname")) To download and install a package, use the following commands: install.packages(‘plyr’) library(plyr) b. Normalize the data: for each value, set the minimum field value to 20 and the maximum to 16,000. (Note: The expression values less than 20 or over 16,000 were considered by biologists unreliable for this experiment.) Indicate that you have done this with the extensions grow.train.norm.tmp and grow.test.norm.tmp Sample R code: normal <- df(df$ID, replace(df[,2:7071],df[,2:7071]<20,20), df$ID.class) c. Import the file table_ALL_AML_samples.txt (you may want to split it into 2 text files – one for “train” and one for “independent”). You will need to rename the first column to be ID, and you probably want to rename the second as well (for example ALL.AML or ID.class). We will only be using the first 2 columns/attributes from this file, so you have the option of altering the text file before importing or only keeping the first 2 columns once imported (using something like new.df <subset(df, select = c(``ID’’,``ID_class’’)) ). Indicate with idclass.train.txt and idclass.test.txt. d. Merge the files you just created with the appropriate gene data files by ID so that the ALL/AML class information is now available. To merge data frames in R use: New.train <- merge(df1 , df2 , by="ID", all=TRUE) How many of the individuals are classifies as ALL and how many as AML in each of the training and test files? e. Export the file as a csv file. Here is R code for exporting data: write.table(dataframe.name, file = "filename.txt", append = FALSE, quote = FALSE, sep = " ", eol = "\n", na = "NA", dec = ".", row.names = TRUE, col.names = TRUE) write.csv(...) The data is now clean enough to build a model. We will do this later. 7. (Bonus) Should you make it through all of the cleaning, here are some questions asking you to examine the gene variation. For the following, just use the “train” data set. a. Compute a fold difference for each gene. Fold difference is the maximum value across samples divided by minimum value. This value is frequently used by biologists to assess gene variability. b. What is the largest fold difference and how many genes have it? c. What is the lowest fold difference and how many genes have it? d. Count how many genes have fold ratio in the following ranges Range Count Val <= 2 .. 2 <Val <= 4 .. 4 <Val <= 8 .. 8 <Val <= 16 .. 16 <Val <= 32 .. 32 <Val <= 64 .. 64 <Val <= 128 .. 128 <Val <= 256 .. 256 <Val <= 512 .. 512 <Val .. e. Graph fold ratio distribution appropriately.