* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Analyzing Expression Data: Clustering and Stats
Survey
Document related concepts
Epigenetics of depression wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Gene expression programming wikipedia , lookup
Transcript
Analyzing Expression Data: Clustering and Stats Chapter 16 Goals • We’ve measured the expression of genes or proteins using the technologies discussed previously. • What can we do with that information? – Identify significant differences in expression – Identify similar patterns of expression (clustering) Analysis steps 1. Data normalization 2. Statistical Analysis 3. Cluster Analysis I. Data Normalization • Why normalize? – Removes systematic errors – Makes the data easier to analyze statistically Sources of Error • Measurements always contain errors. – Systematic (oops) – Random (noise!) • Subtracting the background level can remove some systematic error – Using the ratio in two-channel experiments does this – Subtracting the overall average intensity can be used with onechannel data. • Taking averages over replicates of the experiment reduces the random error. • Advanced error models are mentioned on p. 628 and covered in “Further Reading”. Expression data usually not Gaussian (normal) • Many statistical tests assume that the data is normally distributed. • Expression microarray spot intensity data (for example) is not. • Intensity ratio data (twochannel) is not normal either. • Both go from 0 to infinity whereas normal data is symmetrical. QuickTime™ and a decompressor are needed to see this picture. Taking the logarithm helps normalize expression ratio data • The expression ratio plotted versus the expression level (geometric mean) in both channels. • Plotting the log ratio vs. the log expression level gives data that is centered around y=0 and fairly “normal looking”. QuickTime™ and a decompressor are needed to see this picture. Taking the log of the expression ratio “fixes” the left tail QuickTime™ and a decompressor are needed to see this picture. LOWESS Normalization • Sometimes there is still a bias that depends on the expression level. • This can be removed by a type of regression called “Locally Weighted Scatterplot Smoothing”. • This computes and subtracts the mean locally for various values of expression level (RG). QuickTime™ and a decompressor are needed to see this picture. II. Statistical Analysis • Determining what differences in expression are statistically significant • Controlling false positives When are two measurements significantly different? • We want to say that an expression ratio is significant if it is big enough (>1) or small enough (<1). • A two-fold ratio (for example) is only significant if the variances of the underlying measurements are sufficiently small. • The significance is related to the area of the overlap of the underlying distributions. QuickTime™ and a decompressor are needed to see this picture. The Z-test QuickTime™ and a decompressor are needed to see this picture. QuickTime™ and a decompress or are needed to see this picture. • If the data is approximately normal, convert it to a Z-score. – X can be the log expression ratio; is then 0 – is the sample standard deviation; n is the number of repeats • The Z-score is distributed N(0,1) (standard normal). • The significance level is the area in the tail(s) of the standard normal distribution. The t-test • The t-test makes fewer assumptions about the data than the Z-test • It can be applied to compare two average measurements which can have – Different variances – Different numbers of observations • You compute the t-statistic (see pages 654-655) and then look up the significance level of the Students’ T distribution in a table. III. Cluster Analysis • Similar expression patterns – Groups of genes/proteins with similar expression profiles • Similar expression sub-patterns – Groups of genes/proteins with similar expression profiles in a subset of conditions • Different clustering methods • Assessing the value of clusters Example: Gene Expression Profiles • Expression level of a gene is measured at different time points after treating cells. • Many different expression profiles are possible. – No effect – Immediate increase or decrease – Delayed increase or decrease – Transient increase or decrease Clustering by Eye • n genes or proteins • m different samples (or conditions) • Represent a gene as a point: – X = <x1, x2, …, xm> • If m is 1 or 2 (or even 3) you can plot the points and look for clusters of genes with similar expression. – But what if m is bigger than 3? – Need to reduce the dimensionality: PCA Reducing the Dimensionality of Data: Principal Components Analysis • PCA linearly map each point to a small set of dimensions (components). – The principal components are dimensions that capture the maximum variation in the data. • The principal components capture most of the important information in the data (usually). • Plotting each point’s values in two of the principal component dimensions allows us to see clusters. QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see this picture. 2-D Gel Data PCA: An Illustration Yeast Cell Cycle Gene Expression • Singular value decomposition of a matrix X (SVD) is – X = U VT • The mapped value of X is – Y = X VT • The rows of Y give the mapping of each gene. – Mapped gene i: Yi = <y1, y2, …., ym> @PNAS (2000) Clustering Using Statistics • Algorithm identifies groups. – Example: similar expression profiles • Distance measure between pairs of points is needed. QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see this picture. Distance Measures Between Pairs of Points • In order to cluster the points (genes or conditions), we need some concept of which points are “close” to each other. • So we need a measure of distance (or, conversely,) similarity between two rows (or columns) in our n by m matrix. • We can then compute all the pair-wise distances between rows (or columns). Standard Distance Measures • Euclidean Distance • Pearson Correlation Coefficient • Mahalanobis Distance Euclidean Distance • Standard, everyday distance – Treats all dimensions equally – If some genes vary more than others (have higher variance), they influence the distance more. Mahalanobis Distance • The “normalized” Euclidean distance • Scales each dimension by the variance in that dimension. – This is useful if the genes tend to vary much more in one sample than in others since it reduces the affect of that sample on the distances. Pearson Correlation Coefficient • Distances are small when two genes have similar patterns of change even if the size of the changes are different. • This is accomplished by scaling by the sample variance of the gene’s expression levels under different conditions. Choice of Distance Matters • Heirarchical clustering (dentrogram) of tissues. – Corresponds to clustering the columns of the matrix. • Branches are different (cancer B/C vs A/B). QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see this picture. Clustering Algorithms • Hierarchical Clustering • K-means clustering • Self-organizing maps and trees Hierachical Clustering • Algorithms progressively merge clusters or split clusters. – Merging criterion can by single-linkage or complete-linkage. • Produce dendrograms – Can be interpreted at different thresholds. Types of Linkage • A. Single Linkage • B. Complete Linkage • C. Centroid Method QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see this picture. K-means Clutering • Related to Expectation Maximization • You specify the number of clusters • Iteratively moves the means of the clusters to maximize the likelihood (minimize total error). QuickTime™ and a TIFF (U ncompressed) decompressor are needed to see this picture.