Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Using gene expression to predict cancer - extended We have access to the gene expression arising from tissues of 18 healthy persons and 22 persons with cancer. You will not get access to the name of the cancer and the name of the genes have been modified. The measures have be obtained using DNA microarrays and not by qPCR as in your 5th semester. Usually, raw information given by this method needs to be adjusted to compensate measure errors, and normalised to compensate differences of tissue’s quantity among the different samples (NB: it does not involve house keeping genes). The data have already been adjusted but not normalised. However, they can be studied directly without many consequences so we will not focus on normalisation techniques. If a person has cancer, one sick tissue and one healthy tissue are taken from the same person. This means that you get a data frame with 22*2+18=62 columns. The number of variables (columns) is 1993: • sample: identifies a person • type: cancer or not • columns 3-1993: gene expression variables. The aim of the project is to be able to find a model to predict cancer using gene expression. Notice that there is many genes and it will be problematic to repeat the methods seen in your 5th semester. Therefore, modified techniques should be used as for instance ridge and lasso regression. A good description of lasso regression is freely available in the following book: http://www.stanford.edu/~hastie/StatLearnSparsity/ 1