Download Gene Expression Profiles and Microarray Data Analysis - BIDD

Document related concepts

Epitranscriptome wikipedia , lookup

RNA silencing wikipedia , lookup

Pathogenomics wikipedia , lookup

Transposable element wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

RNA interference wikipedia , lookup

Oncogenomics wikipedia , lookup

X-inactivation wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Epigenetics in learning and memory wikipedia , lookup

Point mutation wikipedia , lookup

Copy-number variation wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Public health genomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genetic engineering wikipedia , lookup

Saethre–Chotzen syndrome wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

History of genetic engineering wikipedia , lookup

Ridge (biology) wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Genome evolution wikipedia , lookup

Epigenetics of human development wikipedia , lookup

NEDD9 wikipedia , lookup

Gene wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

The Selfish Gene wikipedia , lookup

Gene therapy wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genome (book) wikipedia , lookup

Helitron (biology) wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene desert wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene expression programming wikipedia , lookup

Microevolution wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Designer baby wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
LSM3241: Bioinformatics and Biocomputing
Lecture 8: Gene Expression Profiles and
Microarray Data Analysis
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, NUS
Biology and Cells
• All living organisms consist of cells (trillions of
cells in human, yeast has one cell).
• Cells are of many different types (blood, skin,
nerve), but all arose from a single cell (the
fertilized egg)
• Each* cell contains a complete copy of the
genome (the program for making the organism),
encoded in DNA.
2
Gene Expression
• Cells are different because of differential gene
expression.
• About 40% of human genes are expressed at one
time.
• Gene is expressed by transcribing DNA into singlestranded mRNA
• mRNA is later translated into a protein
• Microarrays measure the level of mRNA expression
3
Overview of Molecular Biology
Cell
Nucleus
Chromosome
Protein
cDNA
Gene (mRNA),
single strand
Gene (DNA)
4
Gene Expression
• Genes control cell behavior by
controlling which proteins are
made by a cell
• House keeping genes vs.
cell/tissue specific genes
• Regulation:
• Transcriptional (promoters and
enhancers)
• Post Transcriptional (RNA
splicing, stability, localization small non coding RNAs)
5
Gene Expression
Regulation:
• Translational (3’UTR repressors,
poly A tail)
• Post Transcriptional (RNA
splicing, stability, localization small non coding RNAs)
cDNA
• Post Translational (Protein
modification: carbohydrates,
lipids, phosphorylation,
hydroxylation, methlylation,
precursor protein)
6
Gene Expression Measurement
• mRNA expression represents dynamic aspects of
cell
• mRNA expression can be measured by latest
technology
• mRNA is isolated and labeled with fluorescent
protein
• mRNA is hybridized to the target; level of
hybridization corresponds to light emission which
is measured with a laser
7
Traditional Methods
• Northern Blotting
– Single RNA isolated
– Probed with labeled cDNA
• RT-PCR
– Primers amplify specific cDNA transcripts
8
Microarray Technology
• Microarray:
– New Technology (first paper: 1995)
• Allows study of thousands of genes at same time
– Glass slide of DNA molecules
• Molecule: string of bases (25 bp – 500 bp)
• uniquely identifies gene or unit to be studied
9
Gene Expression Microarrays
The main types of gene expression microarrays:
• Short oligonucleotide arrays (Affymetrix)
• cDNA or spotted arrays (Brown/Botstein).
• Long oligonucleotide arrays (Agilent Inkjet);
• Fiber-optic arrays
• ...
10
Fabrications of Microarrays
• Size of a microscope slide
Images: http://www.affymetrix.com/
11
Differing Conditions
• Ultimate Goal:
– Understand expression level of genes under
different conditions
• Helps to:
– Determine genes involved in a disease
– Pathways to a disease
– Used as a screening tool
12
Gene Conditions
•
•
•
•
•
Cell types (brain vs. liver)
Developmental (fetal vs. adult)
Response to stimulus
Gene activity (wild vs. mutant)
Disease states (healthy vs. diseased)
13
Expressed Genes
• Genes under a given condition
– mRNA extracted from cells
– mRNA labeled
– Labeled mRNA is mRNA present in a given
condition
– Labeled mRNA will hybridize (base pair) with
corresponding sequence on slide
14
Two Different Types of Microarrays
• Custom spotted arrays (up to 20,000 sequences)
– cDNA
– Oligonucleotide
• High-density (up to 100,000 sequences)
synthetic oligonucleotide arrays
– Affymetrix (25 bases)
– SHOW AFFYMETRIX LAYOUT
15
Custom Arrays
• Mostly cDNA arrays
• 2-dye (2-channel)
– RNA from two sources (cDNA created)
• Source 1: labeled with red dye
• Source 2: labeled with green dye
16
Two Channel Microarrays
• Microarrays measure gene expression
• Two different samples:
– Control (green label)
– Sample (red label)
• Both are washed over the microarray
– Hybridization occurs
– Each spot is one of 4 colors
17
Microarray Technology
18
Microarray Image Analysis
• Microarrays detect gene
interactions: 4 colors:
–
–
–
–
Green: high control
Red: High sample
Yellow: Equal
Black: None
• Problem is to quantify
image signals
19
Single Color Microarrays
• Prefabricated
– Affymetrix (25mers)
• Custom
– cDNA (500 bases or so)
– Spotted oligos (70-80
bases)
20
Microarray Animations
• Davidson University:
• http://www.bio.davidson.edu/courses/genomics/chip/chip.html
• Imagecyte:
• http://www.imagecyte.com/array2.html
21
Basic idea of Microarray
• Construction
– Place array of probes on microchip
• Probe (for example) is oligonucleotide ~25 bases long
that characterizes gene or genome
• Each probe has many, many clones
• Chip is about 2cm by 2cm
• Application principle
– Put (liquid) sample containing genes on microarray
and allow probe and gene sequences to hybridize
and wash away the rest
– Analyze hybridization pattern
22
Operation
Principle:
Samples are
tagged with
flourescent
material to
show pattern of
sample-probe
interaction
(hybridization)
Microarray
analysis
Microarray may
have 60K probe
23
Microarray Processing sequence
24
Gene Expression Data
Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
Gene expression level of gene i in mRNA sample j
=
Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
25
Some possible applications
• Sample from specific organ to show which genes
are expressed and responsible for a functionality
• Compare samples from healthy and sick host to
find gene-disease connection
• Analyze samples to differentiate sick and healthy,
disease subtypes, drug response groups
• Probe samples, including human pathogens, for
disease detection
26
Huge amount of data from single microarray
• If just two color, then amount of data on
array with N probes is 2N
• Cannot analyze pixel by pixel
• Analyze by pattern – cluster analysis
27
Major Data Mining Techniques
• Link Analysis
– Associations Discovery
– Sequential Pattern Discovery
– Similar Time Series Discovery
• Predictive Modeling
– Classification (assigns genes into known classes)
– Clustering (groups genes into unknown clusters)
28
Supervised vs. Unsupervised Learning
• Supervised: there is a teacher, class labels
are known
• Support vector machines
• Backpropagation neural networks
• Unsupervised: No teacher, class labels are
unknown
• Clustering
• Self-organizing maps
29
Cluster Analysis:
Grouping Similarly Expressed Genes,
Cell Samples, or Both
• Strengthens signal when averages are taken
within clusters of genes (Eisen)
• Useful (essential?) when seeking new
subclasses of cells, diseases, drug
responses etc.
• Leads to readily interpreted figures
30
Some clustering methods and software
• Partitioning:K-Means, K-Medoids, PAM,
CLARA …
• Hierarchical:Cluster, HAC、BIRCH、CURE、
ROCK
• Density-based: CAST, DBSCAN、OPTICS、
CLIQUE…
• Grid-based:STING、CLIQUE、WaveCluster…
• Model-based:SOM (self-organized map)、
COBWEB、CLASSIT、AutoClass…
• Two-way Clustering
• Block clustering
31
Partitioning
32
Density-based clustering
33
Hierarchical (used most often)
0
1
2
3
4
agglomerative
a
a,b
b
a,b,c,d,e
c
c,d,e
d
d,e
e
4
3
2
1
0
divisive
34
Gene Expression Data
Gene expression data on p genes for n samples
mRNA samples
sample1 sample2 sample3 sample4 sample5 …
Genes
1
2
3
4
0.46
-0.10
0.15
-0.45
0.30
0.49
0.74
-1.03
0.80
0.24
0.04
-0.79
1.51
0.06
0.10
-0.56
0.90
0.46
0.20
-0.32
...
...
...
...
5
-0.06
1.06
1.35
1.09
-1.09
...
Gene expression level of gene i in mRNA sample j
=
Log (Red intensity / Green intensity)
Log(Avg. PM - Avg. MM)
35
Expression Vectors
Gene Expression Vectors encapsulate the
expression of a gene over a set of
experimental conditions or sample types.
Numeric Vector
-0.8
1.5
1.8
0.5 -0.4 -1.3 0.8
1.5
2
Line Graph
0
1
2
3
4
5
6
7
8
-2
Heat map
-2
2
36
Expression Vectors As Points in ‘Expression Space’
G1
G2
G3
G4
G5
t1
t2
t3
-0.8
-0.4
-0.6
0.9
1.3
-0.3
-0.8
-0.8
1.2
0.9
-0.7
-0.7
-0.4
1.3
-0.6
Similar Expression
Experiment 3
Experiment 2
Experiment 1
37
Cluster Analysis
• Group a collection of objects into subsets
or “clusters” such that objects within a
cluster are closely related to one another
than objects assigned to different clusters.
38
How can we do this?
• What is closely related?
• Distance or similarity metric
• What is close?
• Clustering algorithm
• How do we minimize distance between objects in a
group while maximizing distances between groups?
39
Distance Metrics
Gene Expression 2
(5.5,6)
(3.5,4)
Gene Expression 1
• Euclidean Distance
measures average
distance
• Manhattan (City Block)
measures average in
each dimension
• Correlation measures
difference with
respect to linear
trends
40
Clustering Time Series Data
• Measure gene expression
on consecutive days
• Gene Measurement
matrix
• G1= [1.2 4.0 5.0 1.0]
• G2= [2.0 2.5 5.5 6.0]
• G3= [4.5 3.0 2.5 1.0]
• G4= [3.5 1.5 1.2 1.5]
41
Euclidean Distance
0
5.3
4.3
5.1
5.3
0
6.4
6.5
4.3
6.4
0
2.3
5.1
6.5
2.3
0
• Distance is the square root of the sum of the
squared distance between coordinates
2
2
2
•
dij 
dij 
x
i1
 x j1    xi 2  x j 2  
  xin  x jn 
1.2  2   4  2.5  5  5.5  1  6
2
2
2
2
42
City Block or Manhattan Distance
•
•
•
•
G1= [1.2
G2= [2.0
G3= [4.5
G4= [3.5
4.0
2.5
3.0
1.5
5.0
5.5
2.5
1.2
1.0]
6.0]
1.0]
1.5]
0
7.8
6.8
9.1
7.8
0
11
11.3
6.8
11
0
4.3
9.1
11.3
4.3
0
• Distance is the sum of the absolute value
between coordinates
dij  xi1  x j1  xi 2  x j 2   xin  x jn
dij  1.2  2  4  2.5  5  5.5  1  6
43
Correlation Distance
• Pearson correlation
measures the degree of
linear relationship
between variables, [-1,1]
• Distance is 1-(pearson
correlation), range of [0,2]
N
dij  1    1 
0
.91
.98
1.6
.91
0
1.9
1.7
.98
1.9
0
.22
1.6
1.7
.22
0
1
xin x jn 

N
n 1
N
N
x x
n 1
in
n 1
jn
2
2
N
N
 N 2 1 N

1

 
2
  xin    xin    x jn    x jn  
 n 1
 n 1

N
N
n

1
n

1







44
Similarity Measurements
• Pearson Correlation
 x1 
 y1 
  
  
x


y
  
Two profiles (vectors)
  and
 xN 
 y N 
 
C pearson( x , y ) 

N
i 1
( xi  mx )( yi  my )
[i 1 ( xi  mx ) ][i 1 ( yi  my ) 2 ]
N

x

y
2
mx 
1 N
 xn
N n 1
my 
1
N
N
+1  Pearson Correlation  – 1

N
n 1
yn

x

y
45
Hierarchical Clustering
• IDEA: Iteratively combines genes into groups
based on similar patterns of observed expression
• By combining genes with genes OR genes with
groups algorithm produces a dendrogram of the
hierarchy of relationships.
• Display the data as a heat map and dendrogram
• Cluster genes, samples or both
46
(HCL-1)
Hierarchical Clustering
Dendrogram
Venn Diagram of
Clustered Data
47
Hierarchical clustering
• Merging (agglomerative): start with every
measurement as a separate cluster then
combine
• Splitting: make one large cluster, then split up
into smaller pieces
• What is the distance between two clusters?
48
Distance between clusters
• Single-link: distance is the shortest distance
from any member of one cluster to any member
of the other cluster
• Complete link: distance is the longest distance
from any member of one cluster to any member
of the other cluster
• Average: Distance between the average of all
points in each cluster
• Ward: minimizes the sum of squares of any two
clusters
49
Hierarchical Clustering-Merging
• Euclidean
distance
• Average
linking
Distance
between
clusters
when
combined
Gene expression time series
50
Manhattan Distance
• Average linking
Distance between clusters
when combined
Gene expression time series
51
Correlation Distance
52
Data Standardization
• Data points are normalized with respect to mean
and variance, “sphering” the data
x  ˆ
x
ˆ
• After sphering, Euclidean and correlation
distance are equivalent
• Standardization makes sense if you are not
interested in the size of the effects, but in the
effect itself
• Results are misleading for noisy data
53
Hierarchical Clustering
Initial Data Items
Distance Matrix
Dist
A
A
B
C
D
A
B
C
D
20
7
2
B
10 25
C
3
D
54
Hierarchical Clustering
Initial Data Items
Distance Matrix
Dist
A
A
B
C
D
A
B
C
D
20
7
2
B
10 25
C
3
D
55
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist
A
2
A
D
B
C
A
B
C
D
20
7
2
B
10 25
C
3
D
56
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
B
C
AD
20
3
B
10
C
A
D
B
C
57
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
B
C
AD
20
3
B
10
C
A
D
B
C
58
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
B
C
AD
20
3
B
10
C
3
A
D
C
B
59
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
C
AD
C
B
10
B
A
D
C
B
60
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
C
AD
C
B
10
B
A
D
C
B
61
Hierarchical Clustering
Single Linkage
Current Clusters
Distance Matrix
Dist AD
C
AD
C
10
B
10
B
A
D
C
B
62
Hierarchical Clustering
Single Linkage
Final Result
Distance Matrix
Dist AD
CB
AD
CB
A
D
C
B
63
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
64
Hierarchical Clustering
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8
65
Hierarchical Clustering
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
Gene 7
66
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
67
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
68
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
69
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
70
Hierarchical Clustering
Gene 7
Gene 1
Gene 2
Gene 4
Gene 5
Gene 3
Gene 8
Gene 6
71
Hierarchical Clustering
H
L
72
Hierarchical Clustering
Genes
Samples
The Leaf Ordering Problem:
• Find ‘optimal’ layout of branches for a given dendrogram
architecture
• 2N-1 possible orderings of the branches
• For a small microarray dataset of 500 genes, there are
1.6*E150 branch configurations
73
Hierarchical Clustering
The Leaf Ordering Problem:
74
Hierarchical Clustering
• Pros:
– Commonly used algorithm
– Simple and quick to calculate
• Cons:
– Real genes probably do not have a
hierarchical organization
75
Using Hierarchical Clustering
1.
2.
3.
4.
5.
6.
7.
8.
Choose what samples and genes to use in your
analysis
Choose similarity/distance metric
Choose clustering direction
Choose linkage method
Calculate the dendrogram
Choose height/number of clusters for interpretation
Assess results
Interpret cluster structure
76
Limitations
• Cluster analyses:
– Usually outside the normal framework of statistical
inference
– Less appropriate when only a few genes are likely to
change
– Needs lots of experiments
• Single gene tests:
– May be too noisy in general to show much
– May not reveal coordinated effects of positively
correlated genes.
– Hard to relate to pathways
77
Useful Links
• Affymetrix
www.affymetrix.com
• Michael Eisen Lab at LBL (hierarchical clustering software “Cluster” and
“Tree View” (Windows)) rana.lbl.gov/
• Review of Currently Available Microarray Software
www.the-scientist.com/yr2001/apr/profile1_010430.html
• ArrayExpress at the EBI http://www.ebi.ac.uk/arrayexpress/
• Stanford MicroArray Database http://genome-www5.stanford.edu/
• Yale Microarray Database http://info.med.yale.edu/microarray/
• Microarray DB www.biologie.ens.fr/en/genetiqu/puces/bddeng.html
78