Download Analysis of Gene expression data using MATLAB Software

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Metagenomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Designer baby wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Analysis of Gene expression data using
MATLAB Software
R. Priscilla#1, C.N Prashantha*2, S. Swamynathan #3
#
Department of Computer Science and Engineering, Anna University Chennai
Guindy, Chennai, India
1
[email protected]
3
[email protected]
*
Center for Bioinformatics Research Institute, Chennai
203/1, Arcot Road, Vadapalani,, Chennai, India
2
[email protected]
Abstract- In recent years, there have been various
efforts to overcome the limitations of standard
clustering approaches for the analysis of gene
expression data by grouping genes and samples
simultaneously. The underlying concept, which is often
referred to as biclustering, allows to identify sets of
genes sharing compatible expression patterns across
subsets of samples, and its usefulness has been
demonstrated for different organisms and datasets.
Several biclustering methods have been proposed in the
literature; however, it is not clear how the different
techniques compare with each other with respect to the
biological relevance of the clusters as well as with other
characteristics such as robustness and sensitivity to
noise. Accordingly, no guidelines concerning the choice
of the biclustering method are currently available.
There are several options can be used in MATLAB
software to identify the clustering of Microarray data,
import data, normalization and standardization using
clustering techniques, but there is no information on biclustering with visualization of plots based on parallel
co-ordination methods.
Results- First, this paper provides information for
clustering and biclustering algorithms comparing and
validating using simple binary reference models. The
Bi-clustering analysis is based on the subsets of groups
and first to classify the data in clustering method and
comparing these two groups by normalized data to
standardizing the data. By comparing these clustering
and bi-clustering groups in MATLAB software. To
implement the software based on the automated
robustness
of
data
import,
Normalization,
Standardization,
Visualization
with
parallel
coordination using bi-clustering programs and
algorithms.
Conclusion- Based on import of microarray data to the
MATLAB software easily to calculate cluster groups
and bi-cluster groups with the parallel co-ordination of
plots to easily to understand the resulted gene
expressions.
Key words- Clustering, Bi-Clustering, Parallel coordination, MATLAB implementation.
I. INTRODUCTION
The clustering methods has rapidly become
one of the most advanced and older method for
calculate microarray gene expression data analysis,
for the literature surveys there are many number of
clustering algorithms are used but little attention has
been paid to uncertainty in the results obtained. In
clustering, the patterns of expression of different
genes across time, treatments, tissues and intensity of
color are grouped into distinct clusters (perhaps
organized hierarchically and k-means), in which
genes in the same cluster are assumed to be
potentially functionally related or to be influenced by
a common upstream factor. Such cluster structure is
often used to aid the elucidation of regulatory
networks.
Agglomerative hierarchical clustering [1] is
one of the most frequently used methods for
clustering gene expression profiles. However,
commonly used methods for agglomerative
hierarchical clustering rely on the setting of some
score threshold to distinguish members of a particular
cluster from non-members, making the determination
of the number of clusters arbitrary and subjective.
The algorithm provides no guide to choosing the
"correct" number of clusters or the level at which to
prune the tree. It is often difficult to know which
distance metric to choose, especially for structured
data such as gene expression profiles. Moreover,
these approaches do not provide a measure of
uncertainty about the clustering, making it difficult to
compute the predictive quality of the clustering and
to make comparisons between clusterings based on
different model assumptions (e.g. numbers of
clusters, shapes of clusters, etc.). Attempts to address
these problems in a classical statistical framework
have focused on the use of bootstrapping [4,5] or the
use of permutation procedures to calculate local pvalues for the significance of branching in a
dendrogram produced by agglomerative hierarchical
clustering [6,7].
In this paper studying the list of clustering
and bi-clustering algorithms is already present in the
literature survey; these algorithms can be used in
MATLAB to implement the MATLAB
Software. We listed seven different types of
clustering algorithms: single linkage (SL), complete
linkage (CL), average linkage (AL), k-means (KM),
mixture of multivariate Gaussians (FMG), spectral
clustering (SPC) and shared nearest neighbor-based
clustering (SNN). When applicable, we use four
proximity measures together with these methods:
Pearson's Correlation coefficient (P), Cosine (C),
Spearman's correlation coefficient (SP) and
Euclidean Distance (E). Regarding Euclidean
distance, we employ the data in
Table I
List of clustering and Bi-clustering Algorithms
Clustering
Bi-Clustering
Algorithms
Algorithms
k-Center
Block clustering
k-Median/k-MedianCTWC
squared/Facility
Location
Hierarchical
ITWC
Clustering
Clustering Large Data δ-bicluster
Sets
Clustering Data
δ-pCluster
Streams
Spectral Clustering
δ-pattern
Conceptual Clustering FLOC
Bi-clustering
OPC
Correlation Clustering Plaid Model
Clustering with
OPSMs
Outliers
Clustering Moving
Gibbs
Points
SVM Clustering
SAMBA
Catalog Segmentation Robust Biclustering
Algorithm (RoBA)
Community Discovery Crossing Minimization,
cMonkey
Axioms of Clustering
PRMs
Cluster Evaluation
DCC
Model-based
LEB(Localize and Extract
Clustering
Biclusters)
Categorical Clustering QBUIC(QUalitative
BIClustering)
Projective Clustering
BCCA(BiCorrelation
Clustering Algorithm)
Dimension Reduction ZBDD
Scatter/Gather
Text Clustering
four different versions: original (Z0), standardized
(Z1), scaled (Z2) and ranked (Z3) versions.
There are many number of clustering and biclustering algorithms are developed in many research
people, these all the algorithms in MATLAB
software is used to implement the more advanced and
robotic understanding of gene expression using[8]
microarray data analysis. Cheng and Church have
introduced a measure called mean squared residue
score to evaluate the quality of a bicluster and has
become one of the most popular measures to search
for biclusters. These authors [9] reviewed the basic
concepts of the metaheuristics Greedy Randomized
Adaptive Search Procedure (GRASP)-construction
and local search phases and propose a new method
which is a variant of GRASP called Reactive Greedy
Randomized Adaptive Search Procedure (Reactive
GRASP) to detect significant biclusters from large
microarray datasets. The method has two major steps.
First, high quality bicluster seeds are generated by
means of k-means[10] clustering. In the second step,
these seeds are grown using the Reactive GRASP, in
which the basic parameter that defines the
restrictiveness of the candidate list is self-adjusted,
depending on the quality of the solutions found
previously.
These all bi-clustering algorithms belong to
a distinct class of clustering algorithms that perform
simultaneous clustering of both rows and columns of
the gene expression matrix and can be a very useful
analysis tool when some genes have multiple
functions and experimental conditions are diverse.
[11]Based on the k-means clustering, hierarchical
clustering, with statistical calculations like mean,
standard deviation, and correlation and regression can
be used to predict the bi-clustering algorithms in
MATLAB software and to implement this software to
visualize the graphs based on parallel co-ordination
methods.
To studying these literature surveys the
following methodologies were proposed in
MATLAB software[13], mainly bi-clustering
algorithm and visualization of data based on parallel,
Antiparallel and Neural Network analysis &
coordinate Analysis.
The microarray data can be done by
preprocessing gene expression data using logarithm,
and k-means clustering and to filter detected biclusters according to specified requirements such as
minimum number of rows, minimum number of
columns, maximum number of[14,15] biclusters and
maximum overlapping to get the Different Gene
expression values. These values were compared by
using regression and correlation calculations. Based
on gene expression difference and ratio matrices
results can be defined in MATLAB software. Other
common functions can be displayed in biclustering
results to text files.
is implemented in the robust multi-array average
(RMA).
II. MATERIALS AND METHODS
D. Clustering
1) Hierarchical clustering
The hierarchical clustering of these data can
be calculated by using three methods like Node
Score, Level score and Tree score. The Node Score
is for calculating the node specifies a cluster,
enrichment p-values can be calculated to assign the
given node with one of the classes in the data The
significance p-value of observing k instances
assigned by the algorithm to a given category in a set
of n instances is given by
A. Data Collection
The microarray data was collected by using GEO
(Gene Expression Omnibus) and SMD (Stanford
Microarray Database). The example data is Diabetes
Nephropathy with GEO Entry is GDS961; Parent
Platform id is GPL91, Reference Series GSE1009. To
download the data set values and to import data to the
MATLAB software for further analysis.
B. Data import
The selected data from GEO can be imported to
MATLAB software by using Microsoft Excel and
image analysis process. The data can be updated in
command prompt and work space window. To
analyze these large numbers of data by using many
numbers of algorithms and calculation can be
explained in the following steps.
C. Normalization
The affymetrix gene chip microarray sample can
be normalized by using single label scheme, and
consists of several tens of thousands probe sets. A
probe set is a collection of probe pairs that
interrogates the same sequence, or set of sequences,
and contains 11−20 probe pairs of 25-mer
oligonucleotides.
Each
pair
contains
the
complementary sequence to the gene of interest, the
so-called perfect match (PM), and a specificity
control, called the mismatch (MM). MM probes are
designed to discriminate non-specific hybridization.
In order to analyze Gene Chip data with multiple
arrays, the data preprocessing at probe level is critical
step.
The global background correction by signal and
noise (background) convolution model in which PM
intensity distribution is modeled by an exponentially
distributed signal component S with parameter λ, and
a normally distributed background component B with
mean μ and standard deviation σ.
PM=S+B
~
~
,
E (S|PM) represents background corrected value of
each PM. φ and Φ is the normal density and
cumulative density, respectively. Positive signal
components are estimated after adjustment of the
background components. This background correction
,
Where K is the total number of instances assigned to
the class (the category) and N is the number of
instances in the dataset. The p-values for all nodes
and all classes may be viewed as dependent set
estimations.
In Level score a level l of the tree contains
all nodes that are separated by l edges from the root,
Each level specifies a partition of the data into
clusters. Choosing for each node, the class for which
it turned out to have a significant node score,
(J=tp/ (tp+fn+fp),
Where tp is the number of true positive cases, fn the
number of false negative cases and fp the number of
false positive cases). If the node in question has been
judged to be non-significant by the enrichment
criterion, its J-score is set to null. The level score is
defined as the average of all J-scores at the given
level.
Tree score method is to define the weighted best-JScore
Where J*i is the best J-Score for class i in the tree, ni
is the number of instances in class i, c is the number
of classes and N is the number of instances in the
dataset.
2) K-means Clustering
The k- means clustering can be used for
calculating data to find means of noise data
K and N are the number of clusters and genes in the
data sets, m is a parameter which relate to `fuzziness'
of resulting clusters, uki is the degree of membership
of gene xi in cluster k, d2(xi; ck) is the distance from
gene xi to centroid ck.
E. Bi-Clustering
The ZBDD algorithm is used to identify the
bi-clustering of binary data using 0 and 1se columns
and rows. Zero-suppressed BDDs (ZBDDs) are a
variant of ROBDDs that represent a set of
combinations. A combination of n elements is an nbit vector (x1; x2; : : : ; xn)Є Bn where B = {0,1}. The
i-th bit reports whether the i-th element is contained
in the combination. Thus, a set of combinations can
be represented by a Boolean function f : BnÆ B. A
combination given by the input vector (x1; x2; : : : ;
xn) is contained in the set if and only if f(x1; x2; : : : ;
xn) = 1.
F. Parallel Co-ordination
To visualize the calculated data from
MATLAB Software by using a way to visualize the
high dimensional data is to use the parallel coordinate
(PC) plot. All axes are arranged in parallel to each
other on a 1D plane. The additive-related bicluster
shows a number of lines with the same slope across
the conditions. Thus if columns {C2-C1, C3-C1}
with rows R1, R3, R5, R9 and R11 can be visualized
by these type of arrangement in PC plots.
deviation in k-means clustering. (Figure.1.5a and
1.5b).
The two clustering results compared by
using ZBDD bi-clustering algorithm, to observe the
gene expression based on rigidity of sample. The
down regulated gene can be selected as 0th level and
up regulated genes expression is selected as 1th level.
The expression of these genes show (a) The response
time spent by each method in order to find all the
embedded biclusters from the synthetic data sets of
various sizes. (b) The number of biclusters found by
each method within the same time spent as our
method (Figure.1.6a, 1.6b,1.6c and 1.6d).The parallel
co-ordination plot is used to visualize the different
clustering results (Figure1.7).
Figure.1.1a Diabetes Nephropathy sample data
III. RESULTS
The various algorithms is used in biclustering methods to identify the gene expression in
Diabetes Nephropathy which contains six samples,
these samples expressed datasets contains log
normalized data from 6 experiments on 5040 genes.
Lot of online resources is available for gene
expression data. Some important resources for gene
expression data are Stanford Microarray Data website
[10] and Gene Expression Omnibus website. The
input data of this work has been obtained from GEO
website (Figure1.1a and 1.1b).
There are six samples in diabetes
nephropathy 3 is control and 3 are diabetes
nephropathy disease samples (Figure.1.2). The data
samples can be normalized by using hierarchical
clustering algorithmic method and resulted data is
represented by the following methods namely node
score, level score and tree score method (Figure1.3 &
Figure.1.4). The raw data calculated by using
statistical formulation namely mean and standard
Figure.1.1b Sample subsets
Figure.1.2 Import data into MATLAB software.
Figure.1.6b Bi-clustering complete
Figure.1.3. Normalized Data
Figure.1.4 Hierarchical clustering
Figure.1.6c Bi-clustering up regulation of genes
Figure.1.6d Bi-clustering down regulation of genes
Figure1.5a. k-means clustering
Figure.1.5b. k-means clustering of subset data
Figure.1.7. Parallel co-ordination plot
IV. CONCLUSION
Figure.1.6a Bi-clustering
The identification of different gene expression levels
in diabetes nephropathy were observed by using
clustering techniques followed by bi-clustering
methods. The assignment of a set of observations into
subsets (called clusters) so that observations in the
same cluster are similar in some sense. Clustering is a
method of unsupervised learning, and a common
technique for statistical data analysis. The clustering
algorithmic method results shows the up regulated
and down regulated genes in highly overlapped,
where as the bi-clustering ZBDD algorithmic
methods shows clear interpretation of neural network
clusters. The reviews of these results make a
[12]comparative study between these two methods.
The implemented work using MATLAB software is
visualized using parallel co-ordination plots.
ACKNOWLEDGEMENT
The Authors expresses their sincere thanks to the
Department of Computer Science and Engineering,
Anna University Chennai and Department of
Bioinformatics, Centre for Bioinformatics Research
Institute Chennai for providing necessary facility to
conduct the research work.
REFERENCES
[1] Eisen M, Spellman P, Brown P, Botstein D:
Cluster Analysis and Display of Genome-wide
Expression. PNAS 1998, 95:14863-14868.
[2] Alon U, Barkai N, Notterman D, Gish K, Ybarra
S, Mack D, Levine A: Broad Patterns of Gene
Expression Revealed by Clustering Analysis
of Tumor and Normal Colon Tissues Probed
by Oligonucleotide Arrays. Proc Natl Acad Sci
1999, 96:6745-6750.
[3] McLachlan G, Bean R, Peel D: A mixture
model-based approach to the clustering of
microarray expression data. Bioinformatics
2002, 18(3):413-422.
[4] Kerr M, Churchill G: Bootstrapping cluster
analysis: assessing the reliability of
conclusions from microarray experiments.
Proceedings of the National Academy of
Sciences 2001, 98(16):8961.
[5] Zhang K, Zhao H: Assessing reliability of gene
clusters from gene expression data. Funct
Integr Genomics 2000, 1:156-173.
[6] Hughes T, Marton M, Jones A, Roberts C,
Stoughton R, Armour C,Bennett H, Coffey E,
Dai H, He Y, Kidd M, King A, Meyer M, Slade
D,Lum P, Stepaniants S, Shoemaker D, Gachotte
D, Chakraburtty K,Simon J, Bard M, Friend S:
Functional Discovery via a Compendiumof
Expression Profiles. Cell 2000, 102:109-126.
[7] Levenstien M, Yang Y, Ott J: Statistical
significance for hierarchical clustering in
genetic association and microarray expression
studies. BMC bioinformatics 2003, 4:62.
[8] Richard S Savage1, Katherine Heller3, Yang
Xu3, Zoubin Ghahramani3,William M Truman4,
Murray Grant4, Katherine J Denby1,2 and David
L Wild, 1 R/BHC: fast Bayesian hierarchical
clustering for microarray data Systems
Biology Centre, University of Warwick,
published 6 august 2009.
[9] Michael B. Eisen, Paul T. Spellman, Patrick O.
Brownand David Botstein, Cluster analysis and
display of genome-wide expression patterns
Department of Genetics and Department of
Biochemistry and Howard Hughes Medical
Institute, Stanford University School of
Medicine, 300 Pasteur Avenue, December 1998.
[10] Alon U, Barkai N, Notterman DA, Gish K,
Ybarra S, Mack D, Levine AJ. Broad patterns
of gene expression revealed by clustering
analysis of tumor and normal colon tissues
probed by oligonucleotide arrays Department
of Molecular Biology, Princeton University,
Princeton, NJ 08540, USA. 1999.
[11] Johannes M Freudenberg, Vineet K Joshi, Zhen
Hu and Mario CLEAN: CLustering
Enrichment ANalysis Medvedovic Laboratory
for Statistical Genomics and Systems Biology,
Department of Environmental Health, University
of Cincinnati College of Medicin, 2009.
[12] Marcilio CP de Souto, Ivan G Costa, Daniel SA
de Araujo, Teresa B Ludermir and Alexander
Schliep Clustering cancer gene expression
data: a comparative study Computational
Molecular Biology, Max Planck Institute for
Molecular Genetics, Berlin, Germany, Brazil,
2008
[13] Fernando A. Beltrán, José R. Beltrán, Nicolas
Holzem, Adrian Gogu. Matlab Implementation
of Reverberation Algorithms Department of
Electronic Engineering and Communications.
University of Zaragoza (Spain), 2009.
[14] Smitha Dharan and Achuthsankar S Nair,
Biclustering of gene expression data using
reactive greedy randomized adaptive search
procedure Centre for Bioinformatics, University
of Kerala, Thiruvananthapuram, Kerala, 695 581,
India
[15] Afolabi Olomola and Sumeet Dua, Biclustering of Gene Expression Data Using
Conditional Entropy Data Mining Research
Laboratory (DMRL), Department of Computer
Science Louisiana Tech University, Ruston, LA,
U.S.A. School of Medicine, Louisiana State
University Health Sciences. 2009.