Download Microarray Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genome evolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome (book) wikipedia , lookup

Minimal genome wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Microevolution wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Microarray Data Analysis
(MDA)
CS 491 (Individual Project)
Summer Fall 2009
CalState Los Angeles
Department of Computer Science
Prof. Chengyu Sun
Prepared by
Modi, Hardik
December 11th, 2009
Abstract
Microarray technology is widely used for simultaneous monitoring of gene expression profiles of
tens of thousands of genes. The protocol usually results in massive amounts of raw gene
expression data. While softwares to analyze the data are freely available from academic
institutions; open source softwares are not. In the present project, I have developed a Microarray
Data Analysis (MDA) application to analyze raw gene expression data, which provides a basic
framework for development of a comprehensive MDA application.
This MDA application has been used to analyze data on leukemia published by Golub et al
(Science 1999). After primary analysis using this application, genes whose expression is
associated with the leukemic phenotype was identified. The outcome variable was deduced using
the student’s t-test and this is represented as a bar graph. Next, this application was used to
predict the class membership of new leukemia samples using the k nearest neighborhood (kNN)
algorithm. Results analyzed using MDA replicate the quantitative conclusion of Golub et al.
Finally, hierarchical and k-means clustering analysis showed that different classes of leukemia
samples are well correlated with their own classes. Overall this application provides, 1) a tool for
primary data analysis and visualization, 2) successful implementation of statistical method 3)
prediction of class membership and 4) clustering of gene expression data.
1. Introduction
Microarray technology (Appendix b) is widely used to monitor gene expression profiles of tens
of thousands of genes in parallel, from different cells and different experimental conditions. One
such typical experiment is used to monitor gene expression level from different group of
patients. At the end of a typical microarray experiment an image (TIFF) file of the gene
expression data is generated and commercial software is used to report the primary data, which is
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 2 of 22
then further analyzed. The application described here, MDA, will be used to analyze the raw
primary data generated by an experiment. Briefly, it will first preprocess the data, i.e., data
normalization and transformation. This step will eliminate control (housekeeping) genes,
replacing the ones with expression values below a predetermined threshold value; eliminate
genes with no change in fold expression level etc, and represent in a different format, for
example, absolute measurement, relative measurement or expression ratio, log2 (expression
ratio). These data are then represented in graph format to easily visualize. A major goal of this
application is to make meaningful biological inference about the set of genes or samples using
class prediction. This is based on supervised data analysis methods that impose known groups on
data sets, like the k-nearest neighborhood (kNN) classification method. Finally, it also employs
the hierarchical clustering to visualize the data and to explore relationships among distance
metric, variable selection and classification. In conclusion, this application will provide a basic
framework for complete analysis of raw microarray data.
1.1. Rationale
Microarray technology is commercially available from multiple vendors. However, each vendor
requires the user to analyze the data using their proprietary software. The first step after a
microarray experiment is data acquisition to generate the raw data followed by data analysis. A
major issue with microarray data analysis is cost; which may be very significant for laboratories
that do not use microarray routinely. Although, free software from academic institutes is
available, open source software is not. Besides, there are many different methods for data
analysis - statistical and algorithm. In addition, a laboratory usually needs to tailor-make their
own software to analyze their data in multiple ways that may not be an option provided by the
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 3 of 22
vendor. My objective is to develop open source microarray data analysis software, which will
used to analyze raw gene expression data.
1.2. Implementation
The Programming language Perl (http://www.perl.org) is choice of the language for MDA,
because it has number of features such as built in support for text processing, lists, and hast
tables probably would make it possible to express algorithms very concisely. It has simple to use
database support with advanced features. Next, it’s already widely used for server-side scripting
(CGI) in web-based applications, and a large library of code (the module of bioperl effort
described at www.bioperl.org) is freely available to assist bioinformatics programmers. Last but
not least Perl is portable, i.e., it runs on all major operating systems and is freely available at
www.perl.org.
2. System Overview of MDA
Normalization
of the Data
Raw Data
1.
2.
3.
4.
Visual Representation of data in
bar graph and scatter plot
Significance
 t-test

Extraction of fluorescent value
Removal of control genes
Calculation of expression change
Elimination of genes with less than
two fold changes
Classification
K nearest neighborhood (kNN)
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Clustering
 Hierarchical Clustering
 K means clustering
Page 4 of 22
The above schematic workflow outlines the basic steps of MDA to analyze raw gene expression
data, a fluorescent intensity table. This fluorescent intensity table was generated at the end of
microarray experiment and is input for this application for further analysis. In these tables, rows
represent genes, columns represent various samples such as tissues or experimental conditions,
and numbers in each cell characterize the expression level of a particular gene for particular
sample. The first step in this application is primary data analysis; briefly first it retrieves specific
value for a particular sample for a particular gene, followed by preprocessing or data
normalization and transformation. Basically this step eliminates house keeping or control genes,
genes with less than two fold change in expression level to reduce the noise and identify genes
with specific change in expression level. Next it will calculate differentially expressed genes
(i.e., up or down regulated) using intensity ratio, just the raw expression value; log2 (ratio); or
using fold change. This primary data are also represented in form of bar graph and scatter plot, to
easily visualize and further analysis. Next, these differentially expressed genes can be found
more effectively by doing statistically analysis, if the samples are in replicates. This can be done
by student’s t-test, gives the probability associated with it. Student’s T-TEST (test statistics)
used to determine whether two samples are likely to have come from the same two underlying
populations that have the same mean. Next, step is to do exploratory analysis and is class
prediction, to accurately predict classes based on patterns of expression across multiple genes
from different samples. Class prediction can be done by k-nearest neighbor (k-NN), an instancebased learning, is based on supervised data analysis methods that impose known groups on
datasets. Next, hierarchical and k-means analysis would also employ for clustering of gene
expression data. By doing this user will able to visualize the data by looking at the changes in
expression pattern and not to follow the actual numerical or absolute change.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 5 of 22
3. Detail of MDA system
3.1. Preprocessing
The first step after the data acquisition is preprocessing which includes retrieving fluorescent
intensity value for a particular sample for specific gene, followed by primary data analysis that
need to be applied to the data before it is suitable for a detailed analysis.
3.2. Removal of endogenous control gene
Expression ratio of control gene or house keeping should not be change under two conditions,
but often one finds that it deviates from 1. This may be due to various reasons, for example,
variation caused by differential labeling efficiency of the two fluorescent dyes, or different
amounts of starting mRNA material in the two samples. Preprocessing or normalization is a term
used to describe the process of eliminating such variations to allow appropriate comparison of
data obtained from two samples. Here, endogenous control gene or miscellaneous control gene
expression value remove by scanning through the data file and remove from further analysis.
3.3. Calculation of expression change
There are three commonly used measures of expression change. 1) Absolute value 2) Intensity
Ratio is the raw expression value, and 3) Log Ratio and 4) Fold Change are transformationally
derived from it.
3.3.1. Absolute value
Expression values that retrieve in first step then can be viewed as gene expression matrix in
which rows representing genes and columns representing particular conditions. Each cell
contains a value, given in arbitrary units, that reflects the expression level of a gene under a
corresponding condition. Negative expression level of particular gene was replaced by minimum
value for e.g. a value of 20, to represent data in log2 form.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 6 of 22
3.3.2. Intensity Ratio
Its simplest approach and calculated by dividing the intensity of a gene in the sample by the
intensity level of the same gene in the control. The formula for two color data is
TK = RK/GK
TK is expression ratio, RK represents the spot intensity metric for the test sample and GK
represents the spot intensity metric for the control sample. The intensity ratio is one for an
unchanged expression, less than one for down regulated genes and larger than one for upregulated genes.
3.3.3. Log Ratio
Log transformation can be applied to absolute value or intensity ratio to make the data symmetric
(normal-like). The most commonly used log-transformation is 2-based (log 2). It can be simply
calculated by taking log2 of intensity ratio
Log ratio = log 2(intensity ratio)
After the log-transformation, unchanged expression is zero, and both up-regulated and downregulated genes can take values from zero to infinity.
3.3.4. Fold change
Fold change is another means to make the intensity ratio more symmetric. The fold change is
similar to intensity ratio, when expression is higher than one. Below one is intensity ratio is equal
to the inversed intensity ratio.
For values > 1, fold change = TK = RK/GK
For values < 1, fold change = TK’ = 1/ (RK/GK)
As with log transformation, fold change makes distribution makes more symmetric and both up
and down regulated genes takes from zero and infinity.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 7 of 22
3.4. Removal of genes with less than two fold change
Generally goal of microarray experiment is to find the few genes for further studies of the
biologically interesting phenomenon. So it’s a practical to remove the uninteresting genes that
don’t show any expression changes during the experiment, part of the data before classification
or clustering analyses. Usually the intensity ratio cut-offs for uninteresting data (not-changing)
genes are set at 0.5 and 2.0. One way is to remove genes with less than two fold change as just
calculated. The other way to calculate is to find fold change of particular gene is to find the
minimum and maximum value of expression with different experiment and remove that gene
from all across the experiment if fold change is less that two.
3.5. Significance/Finding predictor gene
Up and down regulated genes can be more effectively found, if the chips/samples are replicated,
as well as in order to construct a classifier this application uses the absolute value of the twosample t-statistic with unequal variances. This will gives statistical significance of particular
gene and find out easily differentially expressed genes as well as best predictor genes. The t
statistic can be calculated as
where
Where s2 is the unbiased estimator of the variance of the two samples, n = number of
participants, 1 = group one, 2 = group two. For use in significance testing, the distribution of the
test statistic is approximated as being an ordinary Student's t distribution with the degrees of
freedom calculated using
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 8 of 22
3.6. Classification
The kNN algorithm uses genes with highest prediction strength, determined using data from
training set in previous step, followed by kNN rule to classify the test sample. The number of knearest neighbors, K, is user-defined and implemented by prompting the user for K. Each sample
from the test set is classified by finding the k-nearest neighboring training samples based on the
Euclidean distance of normalized expression intensity. The class membership of k-nearest
neighbors is enumerated and assigns the vote to that class. The class with the higher vote wins if
K is odd, if K is even equal number of vote results in unclassification of that sample.
3.6.1. K-nearest Neighbor algorithm:
a) Counts the k-nearest samples (in Euclidean distance) in the training set to the new sample
to be classified. At this step it also prompts the user classification based on Euclidean
distance between the samples or between the two genes.
b) Determines the proportion of neighbor samples from each class and makes a ‘vote’ for
each class.
c) Majority rules applied at end.
d) Allows “no prediction” result if K is even and results in equal number of votes.
3.6.2. Calculation of Euclidean distance
Euclidean distance is one of the common distance measures used to calculate similarity between
expression profiles. For example the Euclidean distance between two points with dimensions 2,
say A = [a1, a2] and B = [b1, b2] can be calculated as:
Deuc (A,B) = square root of ((a1-b1)2 + (a2-b2)2 )
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 9 of 22
Thus for genes with expression data available for n conditions, represented as A =
[a1,a2,a3,…an] and B = [b1,b2,b3,…bn], Euclidean distance can be calculated as:
n
Deuc (A,B) = square root of (
Σ (ai-bi)2 )
i=1
In other words, the Euclidean distance between two genes is the square root of the sum of the
squares of distances between the values in each condition.
3.7. Cross Validation
Cross validation tests how well predictor genes and the prediction rules are at discriminating
between classes. The cross-validation method, also known as drop-one-out approach, removes
one sample from the training set and uses it as a test sample. The remaining training samples are
used to predict the removed test sample. Best predictor genes obtained in training phase will also
selected from the test samples for further analysis.
Example of Cross-validation in AML ALL data set (38 training samples total):
a) Remove one leukemia sample
b) Predict the membership of the removed leukemia sample using the prediction rule and
data from the remaining 37 training samples.
c) Return removed sample back to training set. Remove another sample
d) Repeat step 2 and 3 until all samples have been predicted
This leave-one-out approach detects samples that have different expression from other samples
in the same group. Thus, potential outlier samples can be detected during cross validation.
3.8. Clustering
Clustering organizes the data into a relatively small number of homogenous groups. In analyzing
the microarray data we are interested in changes in expression patterns, not to follow the actual
numeric changes, so we are using normalized expression value. So, these methods are used to
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 10 of 22
find similar expression motifs irrespective of the expression level. If the expression profiles are
correlated by shape, both low and high expression level genes can end up in the same cluster.
Clustering can be classified in many different ways such as supervised and unsupervised
learning. In supervised learning methods assign some predefined classes to a data set, whereas in
unsupervised methods no prior assumptions are applied. Most commonly used clustering
methods are Hierarchical clustering, K-means, self-organizing maps (SOMs), Principal
component analysis (PCA).
3.8.1. Hierarchical Clustering
Hierarchical clustering is a statistical method for determining relatively similar clusters and
mainly it divided in two separate phases. First a distance matrix containing all the pairwise
distances between the genes is calculated. Followed by hierarchical algorithm iteratively joins
the two closest clusters starting from single clusters, a bottom up approach. After each step, a
new distance matrix between the newly formed clusters and the other clusters is recalculated. For
example, a set of N genes to be clustered, and an NxN distance (or similarity) matrix, the
hierarchical clustering is performed as follows. Best predictor genes found in class prediction
phase were further used for clustering.
a) Assign each gene to a cluster of its own.
b) Find the closest pair of clusters and merge them into a single cluster.
c) Compute the distances (similarities) between the new cluster and each of the old
clusters using single linkage method.
d) Repeat steps 2 and 3 until all genes are clustered.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 11 of 22
3.8.2. K-Means Clustering
K-means, a non-hierarchical clustering, is a least-squares partitioning method for which the
number of groups, K, has to be provided or predetermined. The algorithm computes cluster
centroid and uses them as new cluster centroid, and assigns each object to the nearest centroid.
However, it is also possible to estimate K (no of group) from the data, taking the approach of a
mixture density estimation problem, for e.g, one can use the data generated from the hierarchical
clustering. Briefly algorithm for k-means is
a) The number of clusters can be chosen randomly or estimated by first performing a
hierarchical clustering of the data. (Here its divided in two cluster randomly)
b) Next initialization is performed by calculating average expression profile (centroid)
for each cluster.
c) Next, individual objects are reattributed from one cluster to the other depending on
which centroid is closer to the gene or sample.
d) This procedure of calculating the centroid for each cluster and re-grouping objects
closer to available centroids is performed in an iterative manner for a fixed number of
times, or until convergence (state when composition of clusters remains unaltered by
further iterations).
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 12 of 22
4. Results and Discussion
4.1. Samples:
To test this application data described in paper from Golub et al 1999 were used. There were
total 72 samples and each with 7129 genes obtained by microarray experiment using Affymetrix
high density oligonucleotide array. Each sample belongs to either one of the leukemia type, acute
lymphoblastic leukemia (ALL) or acute myeloid leukemia (AML). The data were collected in
two separate phases and therefore allow for natural definition of training and test set: a training
set of 38 (27 ALL vs. 11 AML) and a test set of 34 (20 ALL vs. 14 AML). Originally Golub et al
have applied clustering and neighborhood analysis and voting scheme to address visualization,
correlation and predictive classification, respectively. Class prediction results correct
classification of 29/34 samples in ALL/AML subtypes and five not classified being consider too
close to classify.
4.2. Results:
4.2.1. Primary analysis.
To verify the Microarray Data Analysis (MDA) application developed here, previously published
data sets are used. As mention first retrieve the gene expression value, followed by data
normalization and transformation. This step results in 6454 genes, i.e., after removing the house
keeping and miscellaneous control genes. Applying t-test for sample assuming unequal variance
yields 1640 genes (23.29%) from the training set are significant, p value ≤ 0.05. These are the
also the best predictor genes and out of 1640 top 50-100 genes which has lowest p value are used
for classification.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 13 of 22
4.2.2. kNN classification
The best performing classifier on the training set has (100 genes and 6 neighbors, k). This
classifier applied to test set and correctly classify 25/35 samples (71%) with 17 out of 21 ALL
and 8 out of 14 correct for AML samples and 4 samples are not able to classify in either of the
group. Applying this classifier to training set itself it correctly classify 21/27 samples, with 6/11
ALL and 7/11 AML samples and 6 samples are not able to classify in either of the group
(Summary as shown below and detail result in appendix c, Table 1).
Results for Test Sample
Classification
Classification
(~100 Genes)
(~1500
True
(pVal =
Genes)
0.00012)
(pVal = 0.05)
Classification
ALL Total
AML Total
Unclassified
17
8
4
Wrong Classification
6
21
14
16
11
4
Results for Training
Classification
(~100 Genes)
True
(pVal =
0.00012)
Classification
21
7
4
Sample
Classification
(~1500
Genes)
(pVal = 0.05)
27
11
6
7
Summary table 1: Classification results using kNN.
Further, also explore the robustness of classification performance under different choices of the
number of genes selected and number of neighbors used. The ALLs are classified relatively well
under any set of parameters except for large number of genes and K=1. The AMLs are more
difficult to classify, with 4 to 7 are correctly classified. Analyzing the AMLs samples reveal that
mistakes focused on two of the AMLs samples.
4.2.3. Hierarchical Clustering
Using the 100 genes with highest absolute t-statistics from the training set, clustered the samples
from both training and test data set. The snap shot of the results are shown in summary table 2
below and detail results in appendix d, table 2. Generally the genes from one group are clustering
well. Analyzing those results revealed that ALLs have higher expression compare to AML
samples.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
21
6
4
Page 14 of 22
STAGE DISTANCE
0
3.90238609
1
4.736685142
2
4.756178629
3
4.882651946
46
47
48
49
50
51
CLUSTER 1
M55150_at
HG1612-HT1612_at
M31523_at
M55150_at
CLUSTER 2
U50136_rna1_at
M31303_rna1_at
U29175_at
M81695_s_at
M80254_at
M80254_at
M27891_at
M19507_at
M89957_at
M12959_s_at
U82759_at
X66533_at
M28130_rna1_s_at
M28130_rna1_s_at
X66533_at
M28130_rna1_s_at
16.74044878
17.26896308
17.9501259
21.20353106
22.7866839
28.170148
Summary table 2: Hierarchical Clustering results of best predictor genes.
4.2.4. K-means Clustering
Using the 100 genes with highest absolute t-statistics from the training set, clustered the samples
from both training and test data set. The snap shot of the results are shown in summary table 3
below. For example clustering genes, first genes are divided in two clusters randomly followed
by performing K-means algorithm. Greater than 90% of genes are stabilize into clusters and
don’t move after 8-9 rounds to clustering, although one or two genes are keep moving from one
clusters to another probably because of distance similarity to both centroids. Further visual
analysis of these clusters shows that 70-72% of genes are best predictor genes of their own class.
STAGE
1
2
3
4
5
6
7
8
9
10
11
12
13
100
Cluster1 Cluster2
53
52
92
13
82
23
87
18
79
26
72
33
71
34
70
35
71
34
70
35
71
34
70
35
71
34
70
34
Summary table 3: K-means Clustering results of best predictor genes.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 15 of 22
Conclusion
Here, I have developed an application, MDA, to analyze the raw gene expression data generated
from microarray experiment using Affymetrix chips. First empirically determine which method
to choose for data normalization and transformation, classification and clustering followed by
implementations. Overall, this application provides the basic framework for analyzing gene
expression data and performs primary data analysis; represent data in graphical format, statistical
analysis, classification and clustering. Further, correct implementation of approach used in MDA
verified by analyzing data published by Golub et al. It should be noted that the Golub et al are
able to correctly classify 29/ 34 samples, while this application was able to correctly classify
only 24 samples. Further clustering by hierarchical and k-means shows that 70-75% genes very
well clustered with their own class. In addition it also suggests that remaining genes are still
background noise and requires further improvement in data normalization and transformation.
Never the less these results validate the correct implementation of algorithm. At present this
application is sequential nature of file input and output, so future work is of course, to develop
user-friendly application with more statistics and algorithm of choice to analyze the data.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 16 of 22
Bibliography
a) Genomic Perl. Rex A Dwyer.
b) An introduction to microarray data analysis. M. Madan Babu.
c) Molecular classification of cancer: Class discovery and class prediction by gene expression
monitoring. T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov,
H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. BloomÞeld, E. S. Lander.
Science, VOL 286 15 OCTOBER 1999
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 17 of 22
Appendices
a) Background, Central dogma of molecular biology
Replication
Transcription
Translation
Measured by Microarray
The central dogma of molecular biology relates DNA, RNA and proteins. And simply “DNA is
transcribed into RNA which is then translated into protein. Briefly put, the Central Dogma makes
the following claims (1)

The amino acid sequence of a protein provides an adequate “blueprint” for the protein’s
production.

Protein blueprints are encoded in DNA in the chromosomes. The encoded blueprint for a
single protein is called a gene.

A dividing cell passes on the blueprints to its daughter cells by making copies of its DNA in
a process called replication.

The blueprints are transmitted from the chromosomes to the protein factories in the cell in the
form of RNA. The process of copying DNA into RNA is called transcription.
The RNA blueprints are read and used to assemble proteins from amino acids in a process known
as translation.
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 18 of 22
b) What Is MicroArray
Microarray technology uses the advantage of human genome sequencing project and compares
the expression of genes (DNA) of your sample to known genes. A microarray is typically a glass
slide (Figure 1) on to which known DNA molecules are fixed in an orderly manner at specific
locations called spots or features. Typically a single slide may contain thousands of spots and
within each spot may contain a few million copies of identical DNA molecules that uniquely
correspond to a gene. The DNA in a spot may either be genomic DNA or short stretch of oligonucleotide strands that correspond to a gene. To accommodate more and more unique DNA the
size of DNA is getting smaller and smaller. The spots are fixed on to the glass slide by the
process of photolithography. (2)
Microarray has many applications in both basic and clinical research, for e.g., it used to compare
gene expression level in normal vs. cancerous tissue, drug treated cells vs. untreated cells or
probably for time course study of particular cells or tissues. It has been widely used to determine
the expression level of drug treated cells (condition A) with untreated cells (condition B).
Briefly the steps involved are shown schematically shown in figure 1B. First, mRNA, a type of
nucleic acid, is extracted from the cells and reverse transcribed into cDNA using the enzyme
called reverse transcriptase (Usually DNA transcribe into RNA, but here RNA forced to make
DNA and hence reverse transcribed). Next, the cDNA, which are just reverse transcribed, are
labeled with different fluorescent dye, for e.g., cDNA from condition A with red and green dye
for condition B cDNA. After that these differentially labeled cDNA will allow to hybridize
(bind) to DNA on to the glass slide. These cDNA will bind to glass slide if and only if sequences
of both DNA are match, i.e., sequences are complementary. The amount of cDNA bound to a
spot and fluorescence emitted will be directly proportional to the initial number of RNA
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 19 of 22
molecules present for that gene in both samples. Fluorescence emissions are excited by a laser at
suitable wavelength to detect red and green fluorescent dye. For instance, if cDNA from
condition A for a particular gene was in greater abundance than that from condition B, one
would find the spot to be red. If it were the other way, the spot would be green. If the gene were
expressed to the same extent in both conditions, one would find the spot to be yellow, and if the
gene were not expressed in both conditions, the spot would be black. Thus, what is seen at the
end of the experimental stage is an image of the microarray, in which each spot that corresponds
to a gene has an associated fluorescence value representing the relative expression level of that
gene.
Figure 1.
Figure 1. A) Microarray representation and B) Schematic of the experimental protocol to study
differential expression of genes. (Adopted from “An Introduction to Microarray Data
Analysis”, M. Madan Babu
Microarray Data Analysis, MDA, by Modi, Summer 2009.
Page 20 of 22
c) Table1 1: Detail analysis results of kNN classification
SAMPLE
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
ALL
AML
VOTES VOTES
6
0
2
4
6
0
3
3
2
4
2
4
6
0
2
4
4
2
2
4
6
0
2
4
6
0
2
4
6
0
2
4
6
0
2
4
6
0
3
3
6
0
3
3
6
0
2
4
4
2
3
3
6
0
2
4
6
0
2
4
6
0
2
4
6
0
3
3
2
4
ALL Total
AML Total
Unclassified
Wrong Classification
Results for Test Sample
Classification
Classification
(~100 Genes)
(~1500
True
(pVal =
Genes)
0.00012)
(pVal = 0.05)
Classification
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
AML
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
ALL
AML
ALL
AML
ALL
AML
ALL
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
AML
AML
AML
AML
AML
AML
AML
17 21
8 14
4
6
Microarray Data Analysis, MDA, by Modi, Summer 2009.
16
11
4
Results for Training
Classification
(~100 Genes)
True
(pVal =
0.00012)
Classification
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
ALL
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
AML
ALL
ALL
ALL
ALL
ALL
ALL
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
AML
ALL
AML
AML
AML
ALL
AML
21
7
4
6
Page 21 of 22
Sample
Classification
27
11
(~1500
Genes)
(pVal = 0.05)
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
ALL
AML
ALL
ALL
ALL
ALL
ALL
AML
ALL
ALL
ALL
AML
AML
ALL
ALL
AML
AML
AML
ALL
AML
AML
ALL
AML
ALL
21
6
4
7
d) Table 2. Clustering results of predictor genes from the training sets.
STAGE DISTANCE
0
3.90238609
1
4.736685142
2
4.756178629
3
4.882651946
4
4.908484268
5
5.182083996
6
5.233545688
7
5.602330349
8
5.935037805
9
6.033568191
10
6.038776923
11
6.083551181
12
6.383679177
13
6.556810674
14
6.658057996
15
6.714705754
16
6.9358182
17
7.307777142
18
7.465346691
19
7.47753534
20
7.865546755
21
7.899660363
22
8.147036047
23
8.169115014
24
8.538012669
25
8.61081586
26
9.031163475
27
9.815361402
28
10.0079682
29
10.21339449
30
10.47210596
31
10.76822899
32
10.82831893
33
10.83775045
34
11.02406968
35
11.24009344
36
11.42818776
37
11.61143783
38
11.68816622
39
12.02556804
40
12.02844307
41
12.89215862
42
13.22957654
43
13.36849708
44
14.21324986
45
14.57555099
46
16.74044878
47
17.26896308
48
17.9501259
49
21.20353106
50
22.7866839
51
28.170148
52
30.04587055
CLUSTER 1
M55150_at
HG1612-HT1612_at
M31523_at
M55150_at
HG1612-HT1612_at
X74801_at
M23197_at
Z15115_at
M91432_at
M91432_at
L13278_at
U20998_at
Y12670_at
M23197_at
M91432_at
M77142_at
M91432_at
L08246_at
L13278_at
M92287_at
L13278_at
L08246_at
L08246_at
M37435_at
L13278_at
M16038_at
M81933_at
M16038_at
U22376_cds2_s_at
U90546_at
D10495_at
M80254_at
M16038_at
X82240_rna1_at
D10495_at
D88422_at
X66533_at
M89957_at
L08246_at
M16038_at
U46499_at
U82759_at
U62136_at
M28130_rna1_s_at
M80254_at
M80254_at
M80254_at
M80254_at
M27891_at
M19507_at
M89957_at
M12959_s_at
X66533_at
Microarray Data Analysis, MDA, by Modi, Summer 2009.
CLUSTER 2
U50136_rna1_at
M31303_rna1_at
U29175_at
M81695_s_at
M92287_at
U29175_at
X70297_at
U22376_cds2_s_at
X74262_at
X74801_at
X52142_at
M12959_s_at
M81695_s_at
U12471_cds1_at
U32944_at
X15949_at
U62136_at
X04085_rna1_at
M83233_at
U22376_cds2_s_at
M77142_at
X17042_at
M63138_at
U12471_cds1_at
X66533_at
Y12670_at
U12471_cds1_at
L09209_s_at
M12959_s_at
M28170_at
X07743_at
M81933_at
X95735_at
L33930_s_at
M80254_at
U46499_at
U90546_at
L33930_s_at
M12959_s_at
X85116_rna1_s_at
M27783_s_at
X85116_rna1_s_at
U82759_at
Y00787_s_at
M27783_s_at
X61587_at
U82759_at
X66533_at
M28130_rna1_s_at
M28130_rna1_s_at
X66533_at
M28130_rna1_s_at
M28130_rna1_s_at
Page 22 of 22