Download Exercise #2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
Weizmann 2015- Introduction to Matlab & Data Analysis
Final assignment
Analyzing experimental data
Tutor in charge of this HW: Anat Zimmer
e-mail for Questions: if you have a question (probably others have the same
question as well) – please ask it in the forum.
Use in the subject the words: “final assignment …”
For personal issues: [email protected]
In this exercise you can choose between analyzing your own
experimental data or the data attached.
General instructions:

Please name your script "final_assignment_ID1_ ID2.m".

You should submit:

Your script ( one “.m” file)

Your data file (in case you chose to analyze your own data
Please do not attach any other files.

A program that crushes (from any reason) will grant their owners in a
failing grade.

Pay attention for efficiency – do not use (nested) loops when it is not
necessary.

Use inner functions whenever you have a piece of code that performs a
specific task.

Write your code in a readable way - please use proper documentation,
indentation, avoid “Magic numbers”, give meaningful names to variables,
avoid too many spaces and long coding lines.
Weizmann 2015- Introduction to Matlab & Data Analysis
The goal of the assignment:
The purpose of this exercise is to write a program that reads (your own) experimental
data, process, analyze and visualize the results - automatically. Attached you will find
a file named “assignment_prototype.m” - it is a function which includes all the
mandatory steps in this assignment. You should change the name of the function to
"final_assignment_ID1_ ID2.m" and implement all steps which are listed under
remarks. You should implement each and every step, even if you analyze your own
data. However, if you analyze your own data, you have the freedom to implement
every step in a way that is relevant for your data, for example – a proper
normalization, relevant statistics to test your hypothesis etc… explanations about
every step are detailed below. Please read all steps (before you start writing the code)
to make sure that you can implement them with your own data, if you can’t think of a
way to implement any one of the steps with your own data, please use the provided
data.
Explanation about the provided data:
The data attached is taken from the TCGA (The Cancer Genome Atlas) data base
(http://cancergenome.nih.gov/). The data base is publically available and contains
many types of information of common types of cancers. The data attached is RNAseq
data (20,501 genes) of 375 melanoma (skin cancer) tumor samples.
RNAseq is a method to measure the amount of mRNA molecules of different genes in
cells. For every one of the 375 samples – cells were taken from a patient (by biopsy or
tumor removal), and their mRNA content was measured in an independent
experiment. We would like to compare between these independent experiments and
therefore we will need to normalize the data as explained below…
if you use your own data:
Please write in your code (under remarks) short explanation of your data: what do you
are measure, and how? What is your hypothesis? – What kind of effect you are trying
to show in your experiments? maximum of 5 sentences.
The steps of the program:
1. Load the data. You should read the data all at once (not line by line). If the
reading procedure did not succeed display the error message: “could not open
Weizmann 2015- Introduction to Matlab & Data Analysis
the input file: <the file name>”, and return. You should read the data from the
current directory.
2. Data processing:
In this part you must implement at least two of the three tasks of data
processing, and all the three if your data requires it. Justify your choice in the
code under remarks.
a. Removing outliers. In this step we would like to remove from the data
any data point that can ruin (or not contribute) the statistics of the data
in an unbiased way. One way to do this is to throw data points that are
far from the data mean. To do this you should calculate the standard
deviation (std) of the data, then set a threshold (for example T=2*std),
find all data points that cross the threshold, and remove them. In this
step visualization can be very useful (though not mandatory). After
you clean the data you still want to be left with enough data points…
so you need to find the optimal T. You can count the number of points
before and after the cleaning of the data, or you can plot the data
before and after the procedure and check that you removed exactly the
outliers. You can also plot a histogram of the data values and see how
the data spreads and cut the top/bottom 5% of the data for instance…
(cut the bins of data that are far away from the data mean), you can
choose any way to remove outliers, justify your choice in the code:
explain in a remark what you did and why. Since the purpose of the
assignment is to help you to analyze your experiments automatically –
think of an iterative way to find the optimal T for every repetition of
your experiment (for every run of the program with different data).
If you chose to use the TCGA data: we would like to remove genes
whose signal was under the detection limit of the method. Therefore
you should find all genes that have zero values, and remove them from
the data.
b. Data normalization (if needed). in case of the provided data, as
explained above each sample was measured in a separate experiment,
think of the situation where in one experiment the content of 10,000
cells was measured, where in another experiment the content of 30,000
Weizmann 2015- Introduction to Matlab & Data Analysis
cells was measured. The amount of reads would be 3 times higher on
average in the later experiment. To compare between experiments
(samples) you should calculate the average of every sample, and divide
the sample by its mean expression.
If you use your own data think of if and how you should normalize
your data, for instance: if you measure protein levels before and after
treatment, maybe you would want to normalize your data to time zero
(before the drug was added), if you compare between “wt” and other
samples, maybe you would like to normalize all the data to the “wt”. If
your experimental data values spans on many orders of magnitude –
maybe you would want to take log(data) etc… please justify your
choice in the code under a remark.
c. Smooth and filter the data. Like explained in class in tutorial 11,
usually we would like to filter out noise from the real phenomena.
Think about what are the frequencies of your measured phenomena,
and what are the frequencies of the noise, can you separate between the
two and filter out the noise? (If your phenomena are in high
frequencies you need to filter out the low frequencies and vice
versa…). For example: you measure protein abundance in a cell that
divides every 24h. Following cell division the protein accumulates in
the cell and get to its maxima after ~24h, then divides and the protein
amount decrease to half of its maxima. If for instance you measure the
protein levels every hour, and your measurements are noisy (like in
florescent microscopy) you might want to use moving average with
window size of +/-5 (for example). If you will average over 12-24 time
points– you will lose your entire signal. Too small a window size
might not filter out the noise. Estimate your noise and filter it out if
needed, justify your choice in the code under remark. Later in the
visualization part jittering lines – are indication that this step wasn’t
done properly.
In the TCGA data, you don’t need to filter or smooth the data,
please explain in your code why?
Weizmann 2015- Introduction to Matlab & Data Analysis
3. Calculate statistics for your data and visualize the results: think about the
type of data that you acquire and what is the effect that you are trying to show
- and use the statistical tools to support your hypothesis, like: mean, median,
min, max, std, correlation coefficient, curve fitting, clustering… etc.. You can
also use statistical tools/calculations that have not been shown in class, like
PCA, MSD, ANOVA, enrichment, P-value, T-test, AKIKE... etc…
If you analyze your own data you must use: mean or median, std, and at
least one of the two: clustering or curve fitting.
The more the merrier…
Visualization: you should choose at least two types of plots and visualize
your data in a way that shows the best your hypothesis.
If you analyze the TCGA data:
We want to cluster the data into subgroups based on the mRNA profiles,
however analyzing 375 samples in ~20,000 dimensions is hard. Moreover,
many of the genes don’t change much and making it difficult to separate
between the groups. Principle component analysis (PCA) is a good way to
overcome the obstacle of too much information and reduce the dimensionality
of the data to few PC’s that explains most of the variation of the data. But
since PCA was not in the course material we will examine two other ways:
a) clustering based on the most varying genes
b) Clustering based on the genes that have the highest mean expression
across samples.
a) Clustering based on the most varying genes:
I.
Calculate the std of the genes across samples, and sort them in
descending order.
II.
Check (and plot) if there are the correlations between the 10 most
varying genes. Answer in the code under remark: how many pair
correlations did you find? Display the names of correlated gene.
Answer in the code under remark: take a look in the correlated genes
geneCards, does it make sense that these genes are correlated? Do they
have a related function?
Weizmann 2015- Introduction to Matlab & Data Analysis
III.
Select the 3 genes that most vary and are not correlated.
IV.
Find the optimal K between 3-10, and calculate the K-means clustering
of the 375 sample with the 3 chosen genes.
V.
Plot the K-means results. Do you think that the cluster is good? why?
Answer in the code under a remark.
VI.
Plot the gene profile of every cluster: Calculate the mean expression
of the genes for every cluster. Create bar plot of the 10 highly
expressed genes and the 10 most down regulated genes (in the same
plot, the values should be sorted). Don’t forget to put an appropriate
title, axis labels (the names of the genes under the bars on the x-axis),
appropriate fonts. Please put every two plots as subplots in a separate
figure.
b) Clustering based on the most expressed genes:
I.
Calculate the mean of the genes across samples, and sort them in
descending order.
II.
Create a clustergram for the 375 samples with the 100 most expressed
genes. clusergram is the heat-map (usually red-green) where you can
see the dendogram (the cluster tree) and how the rows and columns
cluster (for more info read the help). Let the rows to be the samples,
and the column – the genes, use for the row pdist the ‘correlation’
metric and standardize the rows (why? try to run it without
standardizing the rows and see what happen).
III.
Do you think that the cluster is good? How can you tell? To how many
clusters would you divide the data?
IV.
V.
Cluster the data into 3 groups.
Plot the gene profile of every cluster: Calculate the mean expression
of the genes for every cluster. Create a bar plot of the 10 highly
expressed genes and the 10 most down regulated genes (in the same
plot, the values should be sorted). Don’t forget to put an appropriate
title, axis labels (the names of the genes under the bars), appropriate
fonts. Please put all three clusters as subplots in the same figure.
Good luck! 