Download Two-way clustering.

HANDOUT NOTES FOR PROTEO-INFORMATICS C40 Marketa Zvelebil ([email protected]) PROTEOMICS stands for PROTEINS EXPRESSED BY A GENOME PROTEO-INFORMATICS means the use of computer technology to store, deal and analyse proteomics data. PROTEOMICS USUALLY INVOLVES THE FOLLOWING STEPS 1. running a 2DE gel 2. analysing the gel for interesting protein-spots 3. cutting them out and subjecting them to Mass Spec for identification. WHY DO PROTEOMICS? 1. One gene can give a number of protein products – thus helps in study of gene expression 2. To assign function. 3. Measure many factors between types of cells such as normal versus cancer. 4. Differential expression in different types of cells 5. Differential expression over a time course and/or stimulation 6. Differential expression induced by drugs/ligands 7. Absence or presence of proteins in different cells 8. Observe post-translational modification of proteins such as glycosilation or phosphorylation. 9. Identification of interesting spots by Mass Spec. WHAT’S A 2DE GEL 2-D electrophoresis is a widely used method for the analysis of complex protein mixtures extracted from cells, tissues or other biological samples. It sorts proteins according to 2 independent properties in 2 discrete steps: 1. IEF (isoelectric focusing) step separates proteins based on their isoelectric points (pI/pH) 2. SDS-polyacrylamide (SDS-PAGE) separates proteins based on their molecular weight. AFTER RUNNING THE GEL:  Stain the gel 1. Coomassie Blue 2. Silver Stain 3. Fluorescent dyes - different ones on one gel (Cy2, Cy3, Cy5) 4. Radioactive labelling  Scan the gel GEL IMAGE PREPARATION FOR ANALYSIS 1. Spot detection 2. Quantification (Quantitation) 3. Matching 1 DETECTION OF SPOTS Mainly by 1. Edge detection methods such as Laplacian 2. Pixel differentiation QUANTIFICATION Basically this is the measurement of spot intensity (or pixel intensity). The intensity is used to estimate the amount/volume of protein. Problems arise because gel spots are detected by using dyes. Many gels are run with different dyes, ranging from silver staining, fluorescent to radioactively labelled dyes. Different dyes give different intensity. Use of different scanner also gives different intensity. MATCHING To compare gels we need to first align them and then match the spots. Alignment of gels is usually performed by giving the program a number of spots as landmarks from which the algorithm can base the rest of the matching process. Areas of gels can also be used as landmarks. In identical gels (gels from same sample run at the same time) 95% of spots have been matched. Problems: Warping of gels, non-reproducibility, and algorithmic errors. DATA ANALYSIS Pre-processing of data Transformation of data Reducing the amount of data Similarity measure for clustering Euclidean Distance Pearson Correlation Coefficient Clustering Finding a predetermined number of clusters. Hierarchical clustering. SOM based clustering Evolutionary clustering algorithms. Determining the number of clusters as well. Self-organizing tree algorithm, SOTA Two-way clustering. Statistical analysis - The Students t-test Differential mapping PRE-PROCESSING OF DATA Transformation of data Transformations can be used to deal with problems that can occur in the analysis of expression data. (Such as systematic errors from experimental procedures). Ratios of measurements also often require a transformation before analysis (used very much in gene expression). If statistical tests are to be performed, these make certain assumptions about the data (such as normal distribution), which are more likely to be correct after particular transformations. Main transformations are associated with 2 converting pixel density to measures such as volumes and changing measurement by taking logarithms (for ratio data and to obtain a normal-like distribution) Reducing the amount of data When looking at a large set of gels or gene expression data, the data can become very complex and it may be necessary to simplify the data before clustering or differential analysis so that significant features are revealed. For example when dealing with data from many patients we may want to divide the patients into different groups so that differences due to the heterogeneity of patients can be minimised. SIMILARITY MEASURE Suppose we have n different properties measured for each of the N objects, so that the ith object had measurement Xik for the kth property. We need to have a way of obtaining a number (or distance) that describes the difference between object i and object j as given by the measured properties. There are many measures available, two most commonly used are Simple EUCLIDIAN distance 1/ 2  m  Dij    ( X ik  X jk )2   k 1  Another distance measure often used for expression data is the Pearson correlation coefficient between any two series of numbers X = {X1, X2,…, XN} and Y={Y1, Y2,…, YN}.This is defined as: rXY   X i  X  Yi  Y  1    i 1, N  N   X   Y  where X is the average of values in X and X is the standard deviation of these values, similarly for Y . Unlike the Euclidian distance the Pearson correlation measures distance in terms of the shape of the pattern, and not size. Therefore the Pearson correlation will identify two protein features as similar if their expression-shape is similar even if their absolute expression is different. NOTE: A Tree obtained from a clustering based on different distance measures can and often will give different results. 3 CLUSTERING A common method of analyzing expression data is to group protein expressions by similarity. These techniques are referred to as clustering The clusters obtained may be related to each other, usually in a hierarchical tree structure. distance 1 4 3 2 5 expressi on When you start clustering it is not known into how many different groups the expressions should be divided. Some methods are more capable than others of determining an appropriate number of groups into which to put the objects. Therefore the choice of which algorithm to use is important and non trivial as it can have a profound effect on the interpretations of the results. For some methods you, as the user, have to specify how many groups (or clusters) you desire. Other methods choose the number of groups automatically. METHODS OF CLUSTERING There are a number of methods that can be chosen for the actual clustering which are based on how distances are measured between clusters (not to be confused with the distance measure such as the Pearson correlation). The criteria used in the clustering methods differ and hence different classifications may be obtained for the same data, even if the same distance measure is used. Suppose that two clusters have been determined up to this point in the algorithm. The distance between these clusters must be defined, but there are several possible ways to do this. The most common methods to define the distance between any two objects are: Single linkage clustering - uses the minimum and complete linkage clustering which uses the maximum of the distances between all possible pairs of objects in the two clusters. d C A , CB   min dC Ai , C Bj  Single linkage clustering d C A , CB   max dC Ai , C Bj  Complete linkage clustering. 4 where d C A , CB  is the distance between two clusters CA and CB consisting of data points CAi CBi respectively and dC A , C B is the distance between data points according to the similarity measure. Another very common definition for cluster distances is the (average) centroid method where the distances or similarities are calculated between the centroids of the clusters.  d CA, CB   d CA, CB  Average clustering where CA is the mean of the cluster A and CB is the mean of cluster B. These types of distance measurement are general and are used more widely only in hierarchical clustering methods. IDEALISED EXPRESSION PATTERNS These patterns are reflected in cluster outputs such as obtained with K-means clustering (not covered) and SOMs (covered). Ubrupt Up / Dow n High / Low Constant 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 7 8 Up / Dow n Transient Sm ooth Up / Dow n 4.5 4 4 3.5 3.5 3 3 2.5 2.5 2 2 1.5 1 1.5 0.5 0.5 1 0 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 5 SOM (SELF ORGANISING MAP) 1 nodes 2 6 data 3 5 4 The nodes are organised in a single topology (e.g. a 2 x 3 2D grid of nodes). During the training of the SOM to the data, the individual nodes move in the m-space such that they are associated with a set of similar data points. The end results is as many clusters as there are SOM nodes, with the members of a cluster defined as those points for which particular node is the nearest. SOTA - SELF-ORGANIZING TREE ALGORITHM In SOM we have to define the number of groups (clusters) a priory. SOTA does not need that. This method is a combination of the Kohonen networks as used in SOM which allows network nodes to move in response to the data, and a technique to selectively expand the number of nodes. Each protein is represented by a vector of expression measurements, and each network node has an identical structure. The distance between a protein expression vector gi and a network node Cj can be defined by any of the measures discussed previously, and will be written dgicj. The SOTA network differs from that used in SOM in being hierarchical, with each internal node being the ancestor of two daughter nodes. The external nodes are called cells, and only cells and their direct ancestors can be modified (adapted) in the further training of the network. The initial SOTA network consists of three nodes whose vectors are set to the average of all the data. The algorithm consists of a set of alternating steps, firstly adapting the cell(s) in a similar manner to SOM, and then selecting a cell to be extended into two (initially identical) daughter cells, as a result of which the original cell becomes an internal node. During adaptations, proteins become associated with the nearest cell. The average distance of associated proteins from this cell, cj, is referred to as the resource, Rj, i.e. 6 nj Rj   i 1 dg c i j nj where nj is the number of proteins associated with all cj. This is the measure used to determine which cell is to be used to generate two new daughter cells. By choosing a threshold for the resource below which this process will not occur, the SOTA network will evolve into only as many nodes (clusters) as are needed to reduce cluster heterogeneity below this limit. If the threshold chosen is zero, the network will continue to evolve until every node contains just one protein expression. At this point all the cell-protein distances, and hence the resources will be zero. During the cell adaptation, all the proteins are compared one at a time with all the cells. (Note: not the internal nodes!) The set of proteins is ‘presented’ several times. In the  th presentation the closest cell (winning cell) cj is moved nearer to this protein gi by the formulae cj (  1)  cj ( )   ( gi  cj ( )) where is a small constant, typically 0.01. If the sister cell of cj is not an internal node, both it and its direct ancestor node are moved closer to gi, but smaller values of are used so that the effect is less than for cell cj. Typical values of  for the sister and ancestor nodes are 0.005 and 0.001 respectively. The sum of the resources of all adapting cells is monitored and used to determine when to end presentation and grow the network by making daughter cells from the cell with the largest resource. The result of applying SOTA to a protein expression dataset is a hierarchical tree of clusters, with each cluster having a limited degree of heterogeneity. The node values define the averages of the clusters(s) to which they relate. For the diagram below: Initially three nodes are used whose vectors are the average of all the data. The nodes at the edge are called cells. The initial cells are extended into daughter cells. The protein data becomes associated with it’s nearest cell (winning cell). The winning cell is moved by the above equation to the protein data (light arrows). If the sister of the winning cell is not an internal node than the sister and parent move as well. (dark arrows). 7 SOTA TWO-WAY CLUSTERING. There are many occasions when it is useful to cluster the samples, instead of or as well as the protein expressions. One example of this is when looking at tissue samples. In a study of tissue samples obtained from patients with a specific medical condition, the samples are classified according to medical diagnosis based on clinical symptoms and tests. However, there is a large heterogeneity between patients (age, habits (e.g. smoking/non-smoking, living conditions, other disease conditions etc) which affect the sample and may potentially mislead analysis. Clustering the samples can confirm classifications (i.e. hopefully different sample types will form separate clusters). It is useful to do this before study of their molecular features. DIFFERENTIAL EXPRESSION ANALYSIS We often want to know which genes or proteins are differentially expressed, as this should indicate the underlying processes that distinguish between the samples, for example the genes activated at a specific point in the cell cycle. The simplest 8 technique makes these assignments by looking at straight expression ratios. If the ratio exceeds a threshold (e.g. 2), the corresponding gene/protein is assessed to be differentially expressed. • • • Differential expression over a time course and/or stimulation Differential expression induced by drugs/ligands Observe post-translational modification of proteins such as glycosilation or phosphorylation. Student’s t-test In a similar way, we often want to know if two samples are significantly different. This involves comparing the expression levels of a large number of genes/proteins for the two samples. Statistical tests should be employed to quantify the sample difference To analyse which spots are significantly different between two sets of gels the Student’s t-test can be used. Because the test takes into account the variance of the measurements in each set of experiments the larger the variance, the larger the difference needs to be for it too. t X1  X 2 (n1  1) s12  (n2  1) s 22  1 1     n1  n2  2  n1 n2  where n1 and n2 represent the number of independent experiments (gels, samples) in each set, and s1 and s2 are the standard deviations of the two distributions. However, in most expression analysis cases n1 = n2 = n, resulting in the simpler formula X1  X 2 . t s12  s22 n The top part of the equation is the difference between the two means (or averages). The bottom part is a measure of the variability (or dispersion) of the scores. The calculated t-value is compared to a t-distribution. To do this two further quantities must be defined – the significance level and the number of degrees of freedom. Significant t-values are found at the tail ends of the distribution. Note that a distinction is made between testing whether the measurements are different (twotailed) and specifying that one measurement is, say, lower than the other (one-tailed test). Here we are interested in the two-tailed test. The further the calculated t-statistic is from zero, the more likely it is that the two measurements are statistically significantly different. A significance level is set to define the false positive rate that can be tolerated. A false positive here refers to incorrectly deducing the two measurements are different when they are not, which is called a type I error. The rate, often called , is typically set at 5% (or 1%), which means that five times (once) out of a hundred such tests a statistically significant difference between the means is reported even if there was none. The threshold t-statistic value is that for which the area of the tails is the percentage  of the area of the complete curve. 9 The number of degrees of freedom (df) for a particular calculation is defined as the sum of the number of experiments in both groups minus two (n1+n2-2). Once the  level, df, and the t-value are available, reference to a standard t-test table will determine whether the t-value is large enough to be significant. This analysis is automatically done as part of many protein expression programs. We will now work through a simple t-test for an experiment in which there are eight gels in two groups of four, one group the control (Ci) and the other the treated samples (Ti). The measured spot volume for the same protein feature in each gel is given in the table below. To see whether this protein feature changes significantly between the groups we have to analyse the following data: C1 C2 C3 C4 0.0766 0.0644 0.0602 0.1035 T1 0.1138 T2 0.0981 X 1  C i  0.076175 s1 = 0.019499 T3 T4 0.0971 0.1058 X 2 Ti  0.1037 s2 = 0.007775 where the blue entries in the table stand for X 1 and X 2 respectively and the green for s1 and s2. t 0.076175  0.1037 0.0194992  0.007775 4 2  0.27525  2.625 0.010496 The value of -2.625 must be compared with a t-test table, which for df = 6 and  = 5% gives a critical value of 2.447. As our calculated value is greater than this, it is significant according to the 5% level. 10 REVIEW SLIDE The image analysis Spot Detection Quantification Matching Data storage (input/output) Administrative Analytical . Data Analysis Statistical Others Data Integration Data Mining. Pre-processing of data Transformation of data Reducing the amount of data Similarity measure Euclidean Distance Pearson Correlation Coefficient Clustering Finding a predetermined number of clusters. Hierarchical clustering. SOM based clustering Evolutionary clustering algorithms. Determining the number of clusters as well. Self-organizing tree algorithm, SOTA Two-way clustering. Statistical analysis - The Students t-test Differential mapping MAIN STEPS IN PROTEOMICS • Run 2D gel • Dye Gel • Scan Gel • Detect Spots • Quantify • Match Spots • Calculate differentials (& statistics) • Choose spots for Mass Spec • Cut chosen Spots • Digest and run Mass Spec (e.g Maldi MS) • Analyse and save Mass Spec Data • Identify protein based on MS data • Data Mining on identified protein http://www-lmmb.ncifcrf.gov/flicker/ http://www.expasy.ch/melanie/melanie-top.html http://www-lmmb.ncifcrf.gov/2dwgDB/ http://www.harefield.nthames.nhs.uk/nhli/protein/index.html http://www.lsbc.com/ 11 12

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Two-way clustering.