Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Gene Expression Data Analysis Zhang Louxin Dept. of Mathematics Nat. University of Singapore CDNA Microarray Based on hybridization principle; Use parallelism so that one can observe the activity of thousands of genes at a time; P.Brown/Stanford Paradigm for Using cDNA Micro-arrays Patients Animals Appropriate Tissue Extract RNA Scan Sc an Microarray microarray Computer Analysis Cell Lines Microarray Hybridization Data measures the relative ratio of mRNA abundance of each gene in test sample to ref. cDNA microarray schema -- P. Brown’s approach Data from a single experiment measures the relative ratio of mRNA abundance of each gene on the array in the two samples (D. Duggan et al., Nature Genetics, 1999) Applications I Gene function assignment: guilt-by- association; I Cluster genes together into groups; unknown genes are assigned a function based on the known functions of genes in the same expression cluster. Gene prediction; I The regulatory network of living cells: I Clinical diagnosis ( especially for cancers) . For a given cell, arrays can produce a snapshot revealing which genes are on or off at a particular time. Cancers are caused by gene disorders. These disorders result in a deviation of the gene expression profile from that of the normal cell. Microarray Data Analysis Array Quantification (from digital image) • Remove artifacts • Substract background Quality control • Normalization • Detect outliers Data Mining Gene Expression Matrix Difficulties of the Analysis • The myriad random and systematic measurement errors • Random errors are caused by the time that the array are processed, target accessibility, variation in washing procedures. •System errors are bias. They result in a constant tendency to over- or underestimate true values. Biasing factors are dependent on spotting, scanning labelling technologies. • Small numbers of samples (cell lines, patients), but the large number of variables (probes or genes) Normalization 1ratio and log transformation Ratio of raw expression from image quantification are usually not appropriate for statistical analysis. Log-transformed data are usually used. Why? (1). The log transformation removes much of the proportional relationship between random error and signal intensity. Most statistical tests assume an additive error model. (2). Distributions of replicated logged expression values tend to be normal. (3). Summary statistics of log ratio yield same quantities, regardless the numerator/denominator assignment. Example: Consider treatment:control ratios for three replicates 2:1.1, 5:1.4, 15: 2 and inverted ratios. They have difference means and standard deviations but their logs have same means (different signs) and deviations. Normalization 2 - normalize two experiments The expression levels of genes are normalized to a common standard so that they can be compared. Power of microarray analysis comes from the analysis of many experiments to identify common patterns of expression Techniques: • “Housekeeping” genes • Spiked controls • Global normalization to overall distribution exp2 Ref Exp. value Intercept correction exp1 experiments Normalization 3 -Outliers Concept: Outliers are extreme values in a distribution of replicates. The number can be as high as 15% in a typical microarray experiments. Reason: (1). They are caused by image artifacts (e.g. dust on a cDNA array, or blooming of adjoining spots on radioisotopic array). (2). They can also be caused by the factors such as crosshybridization or failure of one probe to hybridization adequately. Detection: Large sample sizes are needed to detect outliers more accurately and precisely. Estimate errors on all the probes, rather than a probe-by-probe basis. Mining Gene Expression DATA Classification: Classifying genes (or tissues, condition) into groups each containing genes (or tissues) with similar attributes. Class Prediction: Given a set of known classes of genes (or tissues), determine the correct class for new genes (or tissues). PART 1: Molecular Classification Traditional Clustering Algorithms: K-means, Self-Organising Maps, Hierarchical Clustering Graph Theoretic-based Clustering Algorithms (Ben-Dor et al.’99, Eartuv et al.’99) K-means, Self-Organising Maps: Input: Gene expression matrix, and an integer k; Output: k disjoint groups of genes with similar expression. Clustering genes K=3 Exp exp1 exp4 g i ? (ai1 , ai 2 , ai 3 , ai 4 ) K-means Algorithm: Arbitrarily partition the input points into K clusters; Each cluster is represented by its geometrical center. Repeatedly adjust K clusters by assigning a point to the nearest cluster. 11 2 initial Input Points K=3 Hierarchical Clustering Algorithm: Input: Some data points; Output: A set of clusters arranged in a tree a hierarchical structure. What is the distance between clusters? Average pairwise distance Each internal node corresponds a cluster. Identify Subtypes of Diffuse large B-Cell Lymphoma ( DLBCL ) (Alizadeh et al. Nature, 2000) I A special cDNA microarray --”Lymphochip” was designed: 12,069 cDNA clones from germinal centre B-cell library 2,338 cDNA clones from libraries derived from DLBCL, follicular lymph.(FL), mantle cell lymph, and chronic lymphocytic leukaemia(CLL); 3,349 other cDNA clones. I Study gene expression patterns in three lymphoid malignancies: DLBCL, FL and CLL. 96 normal and malignant lymphocyte samples Germinal centre B-like DLBCL vs Activated B-like DLBCL Courtesy Alizadeh Germinal centre B-like DLBCL vs Activated B-like DLBCL Courtsey Alizadeh International Prognostic Indicator Remarks • Programmes designed to cluster data generally re-order the rows, or columns, or both, such that pattern of expression becomes visually apparent when present in this fashion. • There might never be a ‘best’ approach for clustering data. Different approaches allow different aspects of the data to be explored. They are subjective. Different distance metrics will place different objects in different clusters. • Understanding the underlying biology, particularly of gene regulation, is important. Research Problem Bi-clustering: cluster genes and experiments at the same time Why? Some genes are only co-regulated in a subset of conditions (experiments). References: Y. Kluger et al. Spectral Biclustering of Microarray data: Coclustering Genes and Conditions, Genome Res. 13, 703-716. L. Zhang and S. Zhu. A New Clustering Method for macroarray data analysis. Proc. IEEE CSB 2002.