Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Weizmann 2015- Introduction to Matlab & Data Analysis Final assignment Analyzing experimental data Tutor in charge of this HW: Anat Zimmer e-mail for Questions: if you have a question (probably others have the same question as well) – please ask it in the forum. Use in the subject the words: “final assignment …” For personal issues: [email protected] In this exercise you can choose between analyzing your own experimental data or the data attached. General instructions: Please name your script "final_assignment_ID1_ ID2.m". You should submit: Your script ( one “.m” file) Your data file (in case you chose to analyze your own data Please do not attach any other files. A program that crushes (from any reason) will grant their owners in a failing grade. Pay attention for efficiency – do not use (nested) loops when it is not necessary. Use inner functions whenever you have a piece of code that performs a specific task. Write your code in a readable way - please use proper documentation, indentation, avoid “Magic numbers”, give meaningful names to variables, avoid too many spaces and long coding lines. Weizmann 2015- Introduction to Matlab & Data Analysis The goal of the assignment: The purpose of this exercise is to write a program that reads (your own) experimental data, process, analyze and visualize the results - automatically. Attached you will find a file named “assignment_prototype.m” - it is a function which includes all the mandatory steps in this assignment. You should change the name of the function to "final_assignment_ID1_ ID2.m" and implement all steps which are listed under remarks. You should implement each and every step, even if you analyze your own data. However, if you analyze your own data, you have the freedom to implement every step in a way that is relevant for your data, for example – a proper normalization, relevant statistics to test your hypothesis etc… explanations about every step are detailed below. Please read all steps (before you start writing the code) to make sure that you can implement them with your own data, if you can’t think of a way to implement any one of the steps with your own data, please use the provided data. Explanation about the provided data: The data attached is taken from the TCGA (The Cancer Genome Atlas) data base (http://cancergenome.nih.gov/). The data base is publically available and contains many types of information of common types of cancers. The data attached is RNAseq data (20,501 genes) of 375 melanoma (skin cancer) tumor samples. RNAseq is a method to measure the amount of mRNA molecules of different genes in cells. For every one of the 375 samples – cells were taken from a patient (by biopsy or tumor removal), and their mRNA content was measured in an independent experiment. We would like to compare between these independent experiments and therefore we will need to normalize the data as explained below… if you use your own data: Please write in your code (under remarks) short explanation of your data: what do you are measure, and how? What is your hypothesis? – What kind of effect you are trying to show in your experiments? maximum of 5 sentences. The steps of the program: 1. Load the data. You should read the data all at once (not line by line). If the reading procedure did not succeed display the error message: “could not open Weizmann 2015- Introduction to Matlab & Data Analysis the input file: <the file name>”, and return. You should read the data from the current directory. 2. Data processing: In this part you must implement at least two of the three tasks of data processing, and all the three if your data requires it. Justify your choice in the code under remarks. a. Removing outliers. In this step we would like to remove from the data any data point that can ruin (or not contribute) the statistics of the data in an unbiased way. One way to do this is to throw data points that are far from the data mean. To do this you should calculate the standard deviation (std) of the data, then set a threshold (for example T=2*std), find all data points that cross the threshold, and remove them. In this step visualization can be very useful (though not mandatory). After you clean the data you still want to be left with enough data points… so you need to find the optimal T. You can count the number of points before and after the cleaning of the data, or you can plot the data before and after the procedure and check that you removed exactly the outliers. You can also plot a histogram of the data values and see how the data spreads and cut the top/bottom 5% of the data for instance… (cut the bins of data that are far away from the data mean), you can choose any way to remove outliers, justify your choice in the code: explain in a remark what you did and why. Since the purpose of the assignment is to help you to analyze your experiments automatically – think of an iterative way to find the optimal T for every repetition of your experiment (for every run of the program with different data). If you chose to use the TCGA data: we would like to remove genes whose signal was under the detection limit of the method. Therefore you should find all genes that have zero values, and remove them from the data. b. Data normalization (if needed). in case of the provided data, as explained above each sample was measured in a separate experiment, think of the situation where in one experiment the content of 10,000 cells was measured, where in another experiment the content of 30,000 Weizmann 2015- Introduction to Matlab & Data Analysis cells was measured. The amount of reads would be 3 times higher on average in the later experiment. To compare between experiments (samples) you should calculate the average of every sample, and divide the sample by its mean expression. If you use your own data think of if and how you should normalize your data, for instance: if you measure protein levels before and after treatment, maybe you would want to normalize your data to time zero (before the drug was added), if you compare between “wt” and other samples, maybe you would like to normalize all the data to the “wt”. If your experimental data values spans on many orders of magnitude – maybe you would want to take log(data) etc… please justify your choice in the code under a remark. c. Smooth and filter the data. Like explained in class in tutorial 11, usually we would like to filter out noise from the real phenomena. Think about what are the frequencies of your measured phenomena, and what are the frequencies of the noise, can you separate between the two and filter out the noise? (If your phenomena are in high frequencies you need to filter out the low frequencies and vice versa…). For example: you measure protein abundance in a cell that divides every 24h. Following cell division the protein accumulates in the cell and get to its maxima after ~24h, then divides and the protein amount decrease to half of its maxima. If for instance you measure the protein levels every hour, and your measurements are noisy (like in florescent microscopy) you might want to use moving average with window size of +/-5 (for example). If you will average over 12-24 time points– you will lose your entire signal. Too small a window size might not filter out the noise. Estimate your noise and filter it out if needed, justify your choice in the code under remark. Later in the visualization part jittering lines – are indication that this step wasn’t done properly. In the TCGA data, you don’t need to filter or smooth the data, please explain in your code why? Weizmann 2015- Introduction to Matlab & Data Analysis 3. Calculate statistics for your data and visualize the results: think about the type of data that you acquire and what is the effect that you are trying to show - and use the statistical tools to support your hypothesis, like: mean, median, min, max, std, correlation coefficient, curve fitting, clustering… etc.. You can also use statistical tools/calculations that have not been shown in class, like PCA, MSD, ANOVA, enrichment, P-value, T-test, AKIKE... etc… If you analyze your own data you must use: mean or median, std, and at least one of the two: clustering or curve fitting. The more the merrier… Visualization: you should choose at least two types of plots and visualize your data in a way that shows the best your hypothesis. If you analyze the TCGA data: We want to cluster the data into subgroups based on the mRNA profiles, however analyzing 375 samples in ~20,000 dimensions is hard. Moreover, many of the genes don’t change much and making it difficult to separate between the groups. Principle component analysis (PCA) is a good way to overcome the obstacle of too much information and reduce the dimensionality of the data to few PC’s that explains most of the variation of the data. But since PCA was not in the course material we will examine two other ways: a) clustering based on the most varying genes b) Clustering based on the genes that have the highest mean expression across samples. a) Clustering based on the most varying genes: I. Calculate the std of the genes across samples, and sort them in descending order. II. Check (and plot) if there are the correlations between the 10 most varying genes. Answer in the code under remark: how many pair correlations did you find? Display the names of correlated gene. Answer in the code under remark: take a look in the correlated genes geneCards, does it make sense that these genes are correlated? Do they have a related function? Weizmann 2015- Introduction to Matlab & Data Analysis III. Select the 3 genes that most vary and are not correlated. IV. Find the optimal K between 3-10, and calculate the K-means clustering of the 375 sample with the 3 chosen genes. V. Plot the K-means results. Do you think that the cluster is good? why? Answer in the code under a remark. VI. Plot the gene profile of every cluster: Calculate the mean expression of the genes for every cluster. Create bar plot of the 10 highly expressed genes and the 10 most down regulated genes (in the same plot, the values should be sorted). Don’t forget to put an appropriate title, axis labels (the names of the genes under the bars on the x-axis), appropriate fonts. Please put every two plots as subplots in a separate figure. b) Clustering based on the most expressed genes: I. Calculate the mean of the genes across samples, and sort them in descending order. II. Create a clustergram for the 375 samples with the 100 most expressed genes. clusergram is the heat-map (usually red-green) where you can see the dendogram (the cluster tree) and how the rows and columns cluster (for more info read the help). Let the rows to be the samples, and the column – the genes, use for the row pdist the ‘correlation’ metric and standardize the rows (why? try to run it without standardizing the rows and see what happen). III. Do you think that the cluster is good? How can you tell? To how many clusters would you divide the data? IV. V. Cluster the data into 3 groups. Plot the gene profile of every cluster: Calculate the mean expression of the genes for every cluster. Create a bar plot of the 10 highly expressed genes and the 10 most down regulated genes (in the same plot, the values should be sorted). Don’t forget to put an appropriate title, axis labels (the names of the genes under the bars), appropriate fonts. Please put all three clusters as subplots in the same figure. Good luck!