Download Mine Microarray Gene Expression Data, Predict Cancers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Gene Expression Data Analysis
Zhang Louxin
Dept. of Mathematics
Nat. University of Singapore
CDNA Microarray
Based on hybridization principle;
Use parallelism so that one can observe the activity
of thousands of genes at a time;
P.Brown/Stanford
Paradigm for Using cDNA Micro-arrays
Patients
Animals
Appropriate
Tissue
Extract RNA
Scan
Sc
an
Microarray
microarray
Computer
Analysis
Cell Lines
Microarray
Hybridization
Data measures the relative ratio of mRNA
abundance of each gene in test sample to ref.
cDNA microarray schema -- P. Brown’s approach
Data from a single experiment measures the relative ratio
of mRNA abundance of each gene on the array in the two
samples (D. Duggan et al., Nature Genetics, 1999)
Applications
I
Gene function assignment: guilt-by- association;
I
Cluster genes together into groups;
unknown genes are assigned a function based on the known
functions of genes in the same expression cluster.
Gene prediction;
I
The regulatory network of living cells:
I
Clinical diagnosis ( especially for cancers) .
For a given cell, arrays can produce a snapshot revealing
which genes are on or off at a particular time.
Cancers are caused by gene disorders. These disorders
result in a deviation of the gene expression profile from
that of the normal cell.
Microarray Data Analysis
Array Quantification
(from digital image)
• Remove artifacts
• Substract background
Quality control
• Normalization
• Detect outliers
Data Mining
Gene Expression Matrix
Difficulties of the Analysis
• The myriad random and systematic measurement errors
• Random errors are caused by the time that
the array are processed, target accessibility,
variation in washing procedures.
•System errors are bias. They result in a constant
tendency to over- or underestimate true values.
Biasing factors are dependent on spotting, scanning
labelling technologies.
• Small numbers of samples (cell lines, patients),
but the large number of variables (probes or genes)
Normalization 1ratio and log transformation
Ratio of raw expression from image quantification are usually
not appropriate for statistical analysis. Log-transformed data
are usually used.
Why?
(1). The log transformation removes much of the proportional
relationship between random error and signal intensity.
Most statistical tests assume an additive error model.
(2). Distributions of replicated logged expression values tend
to be normal.
(3). Summary statistics of log ratio yield same quantities, regardless
the numerator/denominator assignment.
Example: Consider treatment:control ratios for three replicates
2:1.1, 5:1.4, 15: 2
and inverted ratios. They have difference means and standard deviations
but their logs have same means (different signs) and deviations.
Normalization 2
- normalize two experiments
The expression levels of genes are normalized to a common
standard so that they can be compared.
Power of microarray analysis comes from the analysis of
many experiments to identify common patterns of expression
Techniques:
• “Housekeeping” genes
• Spiked controls
• Global normalization to overall distribution
exp2
Ref
Exp. value
Intercept correction
exp1
experiments
Normalization 3
-Outliers
Concept: Outliers are extreme values in a distribution
of replicates. The number can be as high as
15% in a typical microarray experiments.
Reason: (1). They are caused by image artifacts (e.g. dust on a cDNA
array, or blooming of adjoining spots on radioisotopic array).
(2). They can also be caused by the factors such as crosshybridization or failure of one probe to hybridization adequately.
Detection: Large sample sizes are needed to detect outliers more
accurately and precisely.
Estimate errors on all the probes, rather than a
probe-by-probe basis.
Mining Gene Expression DATA
Classification:
Classifying genes (or tissues, condition)
into groups each containing genes (or tissues)
with similar attributes.
Class Prediction:
Given a set of known classes of genes (or tissues),
determine the correct class for new genes
(or tissues).
PART 1:
Molecular Classification
Traditional Clustering Algorithms:
K-means, Self-Organising Maps,
Hierarchical Clustering
Graph Theoretic-based Clustering Algorithms
(Ben-Dor et al.’99, Eartuv et al.’99)
K-means, Self-Organising Maps:
Input: Gene expression matrix, and an integer k;
Output: k disjoint groups of genes with similar expression.
Clustering genes
K=3
Exp
exp1
exp4
g i ? (ai1 , ai 2 , ai 3 , ai 4 )
K-means Algorithm:
Arbitrarily partition the input points into K clusters;
Each cluster is represented by its geometrical center.
Repeatedly adjust K clusters by assigning a point to
the nearest cluster.
11
2
initial
Input Points
K=3
Hierarchical Clustering Algorithm:
Input: Some data points;
Output: A set of clusters arranged in a tree a hierarchical structure.
What is the distance
between clusters?
Average pairwise distance
Each internal node corresponds a cluster.
Identify Subtypes of
Diffuse large B-Cell Lymphoma ( DLBCL )
(Alizadeh et al. Nature, 2000)
I
A special cDNA microarray --”Lymphochip” was designed:
12,069 cDNA clones from germinal centre B-cell library
2,338 cDNA clones from libraries derived from
DLBCL, follicular lymph.(FL), mantle cell lymph,
and chronic lymphocytic leukaemia(CLL);
3,349 other cDNA clones.
I
Study gene expression patterns in three lymphoid
malignancies: DLBCL, FL and CLL.
96 normal and malignant lymphocyte samples
Germinal centre B-like DLBCL
vs
Activated B-like DLBCL
Courtesy Alizadeh
Germinal centre B-like DLBCL
vs
Activated B-like DLBCL
Courtsey Alizadeh
International Prognostic Indicator
Remarks
• Programmes designed to cluster data generally re-order
the rows, or columns, or both, such that pattern of expression
becomes visually apparent when present in this fashion.
• There might never be a ‘best’ approach for clustering data.
Different approaches allow different aspects of the data to
be explored.
They are subjective. Different distance metrics will place
different objects in different clusters.
• Understanding the underlying biology, particularly of
gene regulation, is important.
Research Problem
Bi-clustering: cluster genes and experiments at the same
time
Why? Some genes are only co-regulated in a subset of
conditions (experiments).
References:
Y. Kluger et al. Spectral Biclustering of Microarray
data: Coclustering Genes and Conditions,
Genome Res. 13, 703-716.
L. Zhang and S. Zhu. A New Clustering Method for
macroarray data analysis. Proc. IEEE CSB 2002.