Download Using gene expression to predict cancer - extended

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Using gene expression to predict cancer - extended
We have access to the gene expression arising from tissues of 18 healthy persons and
22 persons with cancer. You will not get access to the name of the cancer and the
name of the genes have been modified.
The measures have be obtained using DNA microarrays and not by qPCR as
in your 5th semester. Usually, raw information given by this method needs to be
adjusted to compensate measure errors, and normalised to compensate differences of
tissue’s quantity among the different samples (NB: it does not involve house keeping
genes).
The data have already been adjusted but not normalised. However, they can be
studied directly without many consequences so we will not focus on normalisation
techniques.
If a person has cancer, one sick tissue and one healthy tissue are taken from the
same person. This means that you get a data frame with 22*2+18=62 columns.
The number of variables (columns) is 1993:
• sample: identifies a person
• type: cancer or not
• columns 3-1993: gene expression variables.
The aim of the project is to be able to find a model to predict cancer using gene
expression. Notice that there is many genes and it will be problematic to repeat the
methods seen in your 5th semester. Therefore, modified techniques should be used as
for instance ridge and lasso regression.
A good description of lasso regression is freely available in the following book:
http://www.stanford.edu/~hastie/StatLearnSparsity/
1