Download (Rfg, Rbg), (Gfg, Gbg)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Public health genomics wikipedia , lookup

Cancer epigenetics wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Oncogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Metagenomics wikipedia , lookup

NEDD9 wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

History of genetic engineering wikipedia , lookup

Genome (book) wikipedia , lookup

Gene wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genomic imprinting wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Microevolution wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ridge (biology) wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
BIOINFORMATICS
Lecture 8
Analyzing Microarray Data
Dr. Aladdin Hamwieh
Khalid Al-shamaa
Abdulqader Jighly
Aleppo University
Faculty of technical engineering
Department of Biotechnology
2010-2011
 Microarray
can monitor many genes at
once, a DNA microarray is an inert, solid,
flat and transparent surface (e.g.: a
microscopic slide) onto which 20,000 to
60,000 short DNA probes of specified
sequences are orderly tethered. Each probe
corresponds to a particular short section of
a gene. So a single gene is covered by
several probes which span different parts of
the gene sequence.
REPOSITORIES OF MICROARRAY STUDIES
 Due
to the large use of microarrays, data
repositories have flourished world-wide.
Three of the largest databases of gene
expression are:
1. The Gene Expression Omnibus (GEO)
2. National
Center
for
Biotechnology
Information (NCBI)
3. Stanford Microarray Data Base (SMD)
And for PLANTS
Plant Expression database
PLEXdb
 DNA
microarrays measure the RNA
abundance with either 1 channel (one color) or
2 channels (two colors).
 Affymetrix GeneChip has 1 channel and use
either fluorescent red dye Cy5 or green
fluorescent dye, Cy3
 Stanford microarrays measure by competitive
hybridization the relative expression under a
given condition (fluorescent red dye Cy5)
compared to its control (labeled with a green
fluorescent dye, Cy3) (Two channels)
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Biological question
Differentially expressed genes
Sample class prediction etc.
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
VIDEO
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
MICROARRAY EXPERIMENT
1.
2.
3.
4.
5.
6.
7.
Isolate mRNA
Make labelled cDNA library
Apply your DNA on the slide
Scan the slide
Purify the picture
Extract the data
Analyse your data
RESULTS
The colors denote the degree of expression in the
experimental versus the control cells.
Gene not expressed in control
or in experimental cells
Only in
control
cells
Mostly in
control
cells
Same in
both cells
Mostly in
experimental
cells
Only in
experimental
cells
Let us talk about the analysis and the
mathematical problems:
Now we have a lot of
pictures which contain
a huge information so:
1- we have to purify
the picture
2- we have to extract
our data.
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
IMAGE ANALYSIS
 The
raw data from a cDNA microarray
experiment consist of pairs of image files,
16-bit TIFFs, one for each of the dyes.
 Image
analysis is required to extract
measures of the red and green
fluorescence intensities for each spot on the
array.
STEPS IN IMAGE ANALYSIS
1. Addressing. Estimate location of spot
centers.
2. Segmentation. Classify pixels as
foreground (signal) or background.
3. Information extraction. For
each spot on the array and each
dye
• foreground intensities;
• background intensities;
• quality measures.
WHY DO WE CALCULATE THE
BACKGROUND INTENSITIES?
 Motivation
behind background adjustment:
A spot’s measured fluorescence intensity
includes a contribution that is not
specifically due to the hybridization of the
target to the probe, but to something else,
e.g. the chemical treatment of the slide,
autofluorescence etc. Want to estimate and
remove this unwanted contribution.
QUANTIFICATION OF EXPRESSION
For each spot on the slide we calculate
Red intensity = Rfg - Rbg
fg = foreground, bg = background, and
Green intensity = Gfg – Gbg
CDNA GENE EXPRESSION DATA
Data on p genes for n samples
down-regulated gene
Genes
1
2
3
4
5
Up-regulated gene
unchanged
sample1 sample2 sample3 sample4 sample5 … expression
0.46
0.30
0.80
1.51
0.00
...
-0.10
0.49
0.24
0.06
0.46
...
0.15
0.74
0.04
0.10
0.20
...
-0.45
-1.03
-0.79
-0.56
-0.32
...
-0.06
1.06
1.35
1.09
-1.09
...
mRNA samples
Gene expression level of gene 5 in mRNA sample 4
= log2( Red intensity / Green intensity)
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
R, G
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
NORMALIZATION
Why?
To correct for systematic differences between
samples on the same slide, or between slides,
which do not represent true biological
variation between samples for example:
1. Dyes activity
2. Dyes quantity
3. scanning parameters
4. location on the array
5. Air bubbles
SELF-SELF HYBRIDIZATIONS
How do we know it is
necessary?
 By
examining self-self hybridizations,
we label one sample from the same
tissue with two dyes Cy3 , Cy5 so We
find dye biases.
Biological question
Differentially expressed genes
Sample class prediction etc.
Experimental design
Microarray experiment
16-bit TIFF files
Image analysis
(Rfg, Rbg), (Gfg, Gbg)
Normalization
R, G
Biological verification
and interpretation
HOMOGENEITY AND SEPARATION PRINCIPLES



Homogeneity: Elements within a cluster are
close to each other
Separation: Elements in different clusters are
further apart from each other
…clustering is not an easy task!
Given these points a
clustering algorithm
might make two
distinct clusters as
follows
BAD CLUSTERING
This clustering violates both
Homogeneity and Separation
principles
Close distances from
points in separate
clusters
Far distances from
points in the same
cluster
GOOD CLUSTERING
This clustering satisfies both
Homogeneity and Separation
principles
CLUSTERING TECHNIQUES

Agglomerative: Start with every element in its
own cluster, and iteratively join clusters together

Divisive: Start with one cluster and iteratively
divide it into smaller clusters

Hierarchical: Organize elements into a tree, leaves
represent genes and the length of the pathes
between leaves represents the distances between
genes. Similar genes lie within the same subtrees
HIERARCHICAL CLUSTERING
4
5
1
2
3
6
7
9
1
2
3
4
5
6
7
9
8
8
HIERARCHICAL CLUSTERING
ALGORITHM
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
Hierarchical Clustering (d , n)
Form n clusters each with one element
Construct a graph T by assigning one vertex to each cluster
while there is more than one cluster
Find the two closest clusters C1 and C2
Merge C1 and C2 into new cluster C with |C1| +|C2| elements
Compute distance from C to all other clusters
Add a new vertex C to T and connect to vertices C1 and C2
Remove rows and columns of d corresponding to C1 and C2
Add a row and column to d corrsponding to the new cluster C
return T
The algorithm takes a nxn distance matrix d of pairwise distances
between points as an input.
K-MEANS CLUSTERING PROBLEM:
FORMULATION
Input: A set, V, consisting of n points and a
parameter k
 Output: A set X consisting of k points (cluster
centers) that minimizes the squared error
distortion d(V,X) over all possible choices of X

1-MEANS CLUSTERING PROBLEM: AN
EASY CASE
Input:
A set, V, consisting of n points
Output:
A single points x (cluster
center) that minimizes the squared
error distortion d(V,x) over all
possible choices of x
1-MEANS CLUSTERING PROBLEM: AN
EASY CASE
 Input:
A set, V, consisting of n points
 Output:
A single points x (cluster center) that
minimizes the squared error distortion d(V,x) over
all possible choices of x
1-Means Clustering problem is easy.
However, it becomes very difficult (NP-complete) for more than one center.
An efficient heuristic method for K-Means clustering is the Lloyd
algorithm
K-MEANS CLUSTERING: LLOYD
ALGORITHM
expression in condition 2
5
4
x1
3
x2
2
1
x3
0
0
1
2
3
4
expression in condition 1
5
K-MEANS CLUSTERING: LLOYD
ALGORITHM
1.
2.
3.
4.
5.
Lloyd Algorithm
Arbitrarily assign the k cluster centers
while the cluster centers keep changing
Assign each data point to the cluster Ci
corresponding to the closest
cluster
representative (center) (1 ≤ i
≤ k)
After the assignment of all data points,
compute new cluster representatives
according to the center of gravity of each
cluster, that is, the new cluster
representative is
∑v \ |C| for all v in C for every cluster C
*This may lead to merely a locally optimal clustering.
THANK YOU