Download Two-way clustering.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
HANDOUT NOTES FOR PROTEO-INFORMATICS C40
Marketa Zvelebil ([email protected])
PROTEOMICS stands for PROTEINS EXPRESSED BY A GENOME
PROTEO-INFORMATICS means the use of computer technology to store, deal and
analyse proteomics data.
PROTEOMICS USUALLY INVOLVES THE FOLLOWING STEPS
1. running a 2DE gel
2. analysing the gel for interesting protein-spots
3. cutting them out and subjecting them to Mass Spec for identification.
WHY DO PROTEOMICS?
1. One gene can give a number of protein products – thus helps in study of gene
expression
2. To assign function.
3. Measure many factors between types of cells such as normal versus
cancer.
4. Differential expression in different types of cells
5. Differential expression over a time course and/or stimulation
6. Differential expression induced by drugs/ligands
7. Absence or presence of proteins in different cells
8. Observe post-translational modification of proteins such as
glycosilation or phosphorylation.
9. Identification of interesting spots by Mass Spec.
WHAT’S A 2DE GEL
2-D electrophoresis is a widely used method for the analysis of complex protein mixtures
extracted from cells, tissues or other biological samples.
It sorts proteins according to 2 independent properties in 2 discrete steps:
1. IEF (isoelectric focusing) step separates proteins based on their isoelectric points
(pI/pH)
2. SDS-polyacrylamide (SDS-PAGE) separates proteins based on their molecular
weight.
AFTER RUNNING THE GEL:
 Stain the gel
1. Coomassie Blue
2. Silver Stain
3. Fluorescent dyes - different ones on one gel (Cy2, Cy3, Cy5)
4. Radioactive labelling
 Scan the gel
GEL IMAGE PREPARATION FOR ANALYSIS
1. Spot detection
2. Quantification (Quantitation)
3. Matching
1
DETECTION OF SPOTS
Mainly by
1. Edge detection methods such as Laplacian
2. Pixel differentiation
QUANTIFICATION
Basically this is the measurement of spot intensity (or pixel intensity).
The intensity is used to estimate the amount/volume of protein.
Problems arise because gel spots are detected by using dyes.
Many gels are run with different dyes, ranging from silver staining, fluorescent to
radioactively labelled dyes. Different dyes give different intensity.
Use of different scanner also gives different intensity.
MATCHING
To compare gels we need to first align them and then match the spots.
Alignment of gels is usually performed by giving the program a number of spots as
landmarks from which the algorithm can base the rest of the matching process.
Areas of gels can also be used as landmarks.
In identical gels (gels from same sample run at the same time) 95% of spots have been
matched.
Problems: Warping of gels, non-reproducibility, and algorithmic errors.
DATA ANALYSIS
Pre-processing of data
Transformation of data
Reducing the amount of data
Similarity measure for clustering
Euclidean Distance
Pearson Correlation Coefficient
Clustering
Finding a predetermined number of clusters.
Hierarchical clustering.
SOM based clustering
Evolutionary clustering algorithms.
Determining the number of clusters as well.
Self-organizing tree algorithm, SOTA
Two-way clustering.
Statistical analysis - The Students t-test
Differential mapping
PRE-PROCESSING OF DATA
Transformation of data
Transformations can be used to deal with problems that can occur in the analysis of
expression data. (Such as systematic errors from experimental procedures). Ratios of
measurements also often require a transformation before analysis (used very much in
gene expression). If statistical tests are to be performed, these make certain
assumptions about the data (such as normal distribution), which are more likely to be
correct after particular transformations. Main transformations are associated with
2
converting pixel density to measures such as volumes and changing measurement by
taking logarithms (for ratio data and to obtain a normal-like distribution)
Reducing the amount of data
When looking at a large set of gels or gene expression data, the data can become very
complex and it may be necessary to simplify the data before clustering or differential
analysis so that significant features are revealed. For example when dealing with data
from many patients we may want to divide the patients into different groups so that
differences due to the heterogeneity of patients can be minimised.
SIMILARITY MEASURE
Suppose we have n different properties measured for each of the N objects, so that the
ith object had measurement Xik for the kth property.
We need to have a way of obtaining a number (or distance) that describes the
difference between object i and object j as given by the measured properties.
There are many measures available, two most commonly used are
Simple EUCLIDIAN distance
1/ 2
 m

Dij    ( X ik  X jk )2 
 k 1

Another distance measure often used for expression data is the Pearson correlation
coefficient between any two series of numbers X = {X1, X2,…, XN} and Y={Y1,
Y2,…, YN}.This is defined as:
rXY 
 X i  X  Yi  Y 
1



i 1, N 
N
  X   Y 
where X is the average of values in X and X is the standard deviation of these values,
similarly for Y . Unlike the Euclidian distance the Pearson correlation measures
distance in terms of the shape of the pattern, and not size. Therefore the Pearson
correlation will identify two protein features as similar if their expression-shape is
similar even if their absolute expression is different.
NOTE: A Tree obtained from a clustering based on different distance measures can
and often will give different results.
3
CLUSTERING
A common method of analyzing expression data is to group protein expressions by
similarity. These techniques are referred to as clustering
The clusters obtained may be related to each other, usually in a hierarchical tree
structure.
distance
1
4
3
2
5
expressi
on
When you start clustering it is not known into how many different groups the
expressions should be divided. Some methods are more capable than others of
determining an appropriate number of groups into which to put the objects. Therefore
the choice of which algorithm to use is important and non trivial as it can have a
profound effect on the interpretations of the results. For some methods you, as the
user, have to specify how many groups (or clusters) you desire. Other methods choose
the number of groups automatically.
METHODS OF CLUSTERING
There are a number of methods that can be chosen for the actual clustering which are
based on how distances are measured between clusters (not to be confused with the
distance measure such as the Pearson correlation). The criteria used in the clustering
methods differ and hence different classifications may be obtained for the same data,
even if the same distance measure is used.
Suppose that two clusters have been determined up to this point in the algorithm. The
distance between these clusters must be defined, but there are several possible ways to
do this. The most common methods to define the distance between any two objects
are: Single linkage clustering - uses the minimum and complete linkage clustering
which uses the maximum of the distances between all possible pairs of objects in the
two clusters.
d C A , CB   min dC Ai , C Bj  Single linkage clustering
d C A , CB   max dC Ai , C Bj  Complete linkage clustering.
4
where d C A , CB  is the distance between two clusters CA and CB consisting of data
points CAi CBi respectively and dC A , C B is the distance between data points according
to the similarity measure.
Another very common definition for cluster distances is the (average) centroid
method where the distances or similarities are calculated between the centroids of the
clusters.

d CA, CB   d CA, CB

Average clustering
where CA is the mean of the cluster A and CB is the mean of cluster B. These types of
distance measurement are general and are used more widely only in hierarchical
clustering methods.
IDEALISED EXPRESSION PATTERNS
These patterns are reflected in cluster outputs such as obtained with K-means
clustering (not covered) and SOMs (covered).
Ubrupt Up / Dow n
High / Low Constant
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
7
8
7
8
Up / Dow n Transient
Sm ooth Up / Dow n
4.5
4
4
3.5
3.5
3
3
2.5
2.5
2
2
1.5
1
1.5
0.5
0.5
1
0
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
5
SOM (SELF ORGANISING MAP)
1
nodes
2
6
data
3
5
4
The nodes are organised in a single topology (e.g. a 2 x 3 2D grid of nodes). During
the training of the SOM to the data, the individual nodes move in the m-space such
that they are associated with a set of similar data points. The end results is as many
clusters as there are SOM nodes, with the members of a cluster defined as those points
for which particular node is the nearest.
SOTA - SELF-ORGANIZING TREE ALGORITHM
In SOM we have to define the number of groups (clusters) a priory. SOTA does not
need that.
This method is a combination of the Kohonen networks as used in SOM which allows
network nodes to move in response to the data, and a technique to selectively expand
the number of nodes. Each protein is represented by a vector of expression
measurements, and each network node has an identical structure. The distance
between a protein expression vector gi and a network node Cj can be defined by any
of the measures discussed previously, and will be written dgicj. The SOTA network
differs from that used in SOM in being hierarchical, with each internal node being the
ancestor of two daughter nodes. The external nodes are called cells, and only cells and
their direct ancestors can be modified (adapted) in the further training of the network.
The initial SOTA network consists of three nodes whose vectors are set to the average
of all the data. The algorithm consists of a set of alternating steps, firstly adapting the
cell(s) in a similar manner to SOM, and then selecting a cell to be extended into two
(initially identical) daughter cells, as a result of which the original cell becomes an
internal node. During adaptations, proteins become associated with the nearest cell.
The average distance of associated proteins from this cell, cj, is referred to as the
resource, Rj, i.e.
6
nj
Rj  
i 1
dg c
i j
nj
where nj is the number of proteins associated with all cj. This is the measure used to
determine which cell is to be used to generate two new daughter cells. By choosing a
threshold for the resource below which this process will not occur, the SOTA network
will evolve into only as many nodes (clusters) as are needed to reduce cluster
heterogeneity below this limit. If the threshold chosen is zero, the network will
continue to evolve until every node contains just one protein expression. At this point
all the cell-protein distances, and hence the resources will be zero.
During the cell adaptation, all the proteins are compared one at a time with all the
cells. (Note: not the internal nodes!) The set of proteins is ‘presented’ several times. In
the  th presentation the closest cell (winning cell) cj is moved nearer to this protein gi
by the formulae
cj (  1)  cj ( )   ( gi  cj ( ))
where is a small constant, typically 0.01. If the sister cell of cj is not an internal
node, both it and its direct ancestor node are moved closer to gi, but smaller values of
are used so that the effect is less than for cell cj. Typical values of  for the sister
and ancestor nodes are 0.005 and 0.001 respectively. The sum of the resources of all
adapting cells is monitored and used to determine when to end presentation and grow
the network by making daughter cells from the cell with the largest resource.
The result of applying SOTA to a protein expression dataset is a hierarchical tree of
clusters, with each cluster having a limited degree of heterogeneity. The node values
define the averages of the clusters(s) to which they relate.
For the diagram below: Initially three nodes are used whose vectors are the average of
all the data. The nodes at the edge are called cells. The initial cells are extended into
daughter cells. The protein data becomes associated with it’s nearest cell (winning
cell). The winning cell is moved by the above equation to the protein data (light
arrows). If the sister of the winning cell is not an internal node than the sister and
parent move as well. (dark arrows).
7
SOTA
TWO-WAY CLUSTERING.
There are many occasions when it is useful to cluster the samples, instead of or as
well as the protein expressions. One example of this is when looking at tissue
samples. In a study of tissue samples obtained from patients with a specific medical
condition, the samples are classified according to medical diagnosis based on clinical
symptoms and tests. However, there is a large heterogeneity between patients (age,
habits (e.g. smoking/non-smoking, living conditions, other disease conditions etc)
which affect the sample and may potentially mislead analysis. Clustering the samples
can confirm classifications (i.e. hopefully different sample types will form separate
clusters). It is useful to do this before study of their molecular features.
DIFFERENTIAL EXPRESSION ANALYSIS
We often want to know which genes or proteins are differentially expressed, as this
should indicate the underlying processes that distinguish between the samples, for
example the genes activated at a specific point in the cell cycle. The simplest
8
technique makes these assignments by looking at straight expression ratios. If the
ratio exceeds a threshold (e.g. 2), the corresponding gene/protein is assessed to be
differentially expressed.
•
•
•
Differential expression over a time course and/or stimulation
Differential expression induced by drugs/ligands
Observe post-translational modification of proteins such as
glycosilation or phosphorylation.
Student’s t-test
In a similar way, we often want to know if two samples are significantly different.
This involves comparing the expression levels of a large number of genes/proteins for
the two samples. Statistical tests should be employed to quantify the sample
difference
To analyse which spots are significantly different between two sets of gels the
Student’s t-test can be used. Because the test takes into account the variance of the
measurements in each set of experiments the larger the variance, the larger the
difference needs to be for it too.
t
X1  X 2
(n1  1) s12  (n2  1) s 22  1
1
   
n1  n2  2
 n1 n2 
where n1 and n2 represent the number of independent experiments (gels, samples) in
each set, and s1 and s2 are the standard deviations of the two distributions. However,
in most expression analysis cases n1 = n2 = n, resulting in the simpler formula
X1  X 2
.
t
s12  s22
n
The top part of the equation is the difference between the two means (or averages).
The bottom part is a measure of the variability (or dispersion) of the scores. The
calculated t-value is compared to a t-distribution. To do this two further quantities
must be defined – the significance level and the number of degrees of freedom.
Significant t-values are found at the tail ends of the distribution. Note that a
distinction is made between testing whether the measurements are different (twotailed) and specifying that one measurement is, say, lower than the other (one-tailed
test). Here we are interested in the two-tailed test. The further the calculated t-statistic
is from zero, the more likely it is that the two measurements are statistically
significantly different. A significance level is set to define the false positive rate that
can be tolerated. A false positive here refers to incorrectly deducing the two
measurements are different when they are not, which is called a type I error. The rate,
often called , is typically set at 5% (or 1%), which means that five times (once) out
of a hundred such tests a statistically significant difference between the means is
reported even if there was none. The threshold t-statistic value is that for which the
area of the tails is the percentage  of the area of the complete curve.
9
The number of degrees of freedom (df) for a particular calculation is defined as the
sum of the number of experiments in both groups minus two (n1+n2-2). Once the 
level, df, and the t-value are available, reference to a standard t-test table will
determine whether the t-value is large enough to be significant. This analysis is
automatically done as part of many protein expression programs.
We will now work through a simple t-test for an experiment in which there are eight
gels in two groups of four, one group the control (Ci) and the other the treated samples
(Ti). The measured spot volume for the same protein feature in each gel is given in the
table below. To see whether this protein feature changes significantly between the
groups we have to analyse the following data:
C1
C2
C3
C4
0.0766
0.0644
0.0602
0.1035
T1
0.1138
T2
0.0981
X 1  C i  0.076175
s1 = 0.019499
T3
T4
0.0971
0.1058
X 2 Ti 
0.1037
s2 = 0.007775
where the blue entries in the table stand for X 1 and X 2 respectively and the green for
s1 and s2.
t
0.076175  0.1037
0.0194992  0.007775
4
2

0.27525
 2.625
0.010496
The value of -2.625 must be compared with a t-test table, which for df = 6 and  = 5%
gives a critical value of 2.447. As our calculated value is greater than this, it is
significant according to the 5% level.
10
REVIEW SLIDE
The image analysis
Spot Detection
Quantification
Matching
Data storage (input/output)
Administrative
Analytical
.
Data Analysis
Statistical
Others
Data Integration
Data Mining.
Pre-processing of data
Transformation of data
Reducing the amount of data
Similarity measure
Euclidean Distance
Pearson Correlation Coefficient
Clustering
Finding a predetermined number of clusters.
Hierarchical clustering.
SOM based clustering
Evolutionary clustering algorithms.
Determining the number of clusters as well.
Self-organizing tree algorithm, SOTA
Two-way clustering.
Statistical analysis - The Students t-test
Differential mapping
MAIN STEPS IN PROTEOMICS
• Run 2D gel
• Dye Gel
• Scan Gel
• Detect Spots
• Quantify
• Match Spots
• Calculate differentials (& statistics)
• Choose spots for Mass Spec
• Cut chosen Spots
• Digest and run Mass Spec (e.g Maldi MS)
• Analyse and save Mass Spec Data
• Identify protein based on MS data
• Data Mining on identified protein
http://www-lmmb.ncifcrf.gov/flicker/
http://www.expasy.ch/melanie/melanie-top.html
http://www-lmmb.ncifcrf.gov/2dwgDB/
http://www.harefield.nthames.nhs.uk/nhli/protein/index.html
http://www.lsbc.com/
11
12