Download Introduction of Microarray - genomics-lab

Document related concepts

Comparative genomic hybridization wikipedia , lookup

Replisome wikipedia , lookup

RNA silencing wikipedia , lookup

RNA polymerase II holoenzyme wikipedia , lookup

Eukaryotic transcription wikipedia , lookup

Genome evolution wikipedia , lookup

Molecular cloning wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Gene regulatory network wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Epitranscriptome wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Non-coding RNA wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Point mutation wikipedia , lookup

Gene wikipedia , lookup

Non-coding DNA wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Real-time polymerase chain reaction wikipedia , lookup

Molecular evolution wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Degradomics wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene expression wikipedia , lookup

Gene expression profiling wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
Microarray
Introduction
Content
•
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
The Biology Background of
Microarray
•
•
•
•
The central dogma of life forms
DNA
RNA
Monitoring the expression of genes
Central Dogma
• DNA Replication
--ACGCGA---TGCGCT--
• RNA Transcription
--UGCGCU--
• Protein Translation
--CYSALA--
DNA
replication
transcription
DNA
RNA
translation
Protein
DNA
• The double helix
– stable
• Nucleotide
– A, T, G, C
• Base pair
– A–T
– G–C
• Oligonucleotide
– short DNA (tens of
nucleotides, or bps)
DNA Strand
• DNA has canonical orientation
– read from 5’ to 3’
– antiparallel: one strand has direction opposite
to its complement’s
5’ …
3’ …
TACTGAA … 3’
ATGACTT … 5’
Hydrogen Bond Makes DNA
Binding Specifically
Hydrogen bond
5’
3’
5’
3’
Hydrogen Bond Makes DNA
Binding Specifically
• The force between base pair is hydrogen
bond, This force let
A-T(U), C-G can specifically match together.
RNA
replication
transcription
DNA
RNA
translation
Protein
RNA
• Types
– messenger RNA
– ribosomal RNA (rRNA)
– transfer RNA (tRNA)
Gene is expressed by transcribing DNA
into single-stranded mRNA
RNA (Detailed)
(http://www.nhgri.nih.gov/)
Reverse Transcription
replication
transcription
DNA
translation
RNA
Protein
Reverse Transcription
By reverse transcriptase, we can convert RNA into cDNA.
The Southern Blot
•
Basic DNA detection technique that has
been used for over 30 years, known as
Southern blots:
–
–
–
–
A “known” strand of DNA is deposited on a solid
support (i.e. nitocellulose paper)
An “unknown” mixed bag of DNA is labelled
(radioactive or flourescent)
“Unknown” DNA solution allowed to mix with
known DNA (attached to nitro paper), then
excess solution washed off
If a copy of “known” DNA occurs in “unknown”
sample, it will stick (hybridize), and labeled DNA
will be detected on photographic film
mRNA Represent Gene
Function
• When measure the level of a mRNA, we are
monitoring the activity of a gene.
• Thus, if we can understand all the level of
mRNAs, we can study the expression of whole
genome.
• Microarray takes the advantage of getting over
10000 of blotting data in a single experiment,
which makes monitoring the genome activity
possible.
Content
•
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Design of Microarray
• Microarray in different context
• The idea of microarray
• Main type of array chips
mRNA Levels Compared in Many
Different Contexts
 Different tissues, same organism (brain v.
liver)
 Same tissue, same organism (tumor v. nontumor)
 Same tissue, different organisms (wt v.
mutant)
 Time course experiments (development)
 Other special designs (e.g. to detect spatial
patterns).
Idea of Microarray
Cell A
Cell B
Labeled cDNA
from geneX
Hybridizaton to chip
Spot of geneX with
complementary sequence
of colored cDNA
This spot shows red color after scanning.
Over 10,000 Hybridization Could
Be Down at One Time
Several Types of Arrays
• Spotted DNA arrays
– Developed by Pat Brown’s lab at Stanford
– PCR products of full-length genes (>100nt)
• Affymetrix gene chips
– Photolithography technology from
computer industry allows building many 25mers
• Ink-jet microarrays from Agilent
– 25-60-mers “printed directly on glass slides
– Flexible, rapid, but expensive
Array Fabrication Spotting
• Use PCR to amplify DNA
• Robotic "pen" deposits DNA at defined
coordinates
• approximately 1-10 ng per spot
• Experimentation with oligos (40, 70 bp)
This machine can make 48 microarrays
simultaneously.
Array Fabrication Photo-lithography
• Light activated synthesis
• synthesize oligonucleotides on glass slides
• 107copies per oligo in 24 x 24 um square
• Use 20 pairs of different 25-mers per gene
• Perfect match and mismatch
Array Fabrication
Photolithography
Affymetrix Microarrays
Raw image
1.28cm
50um
~107 oligonucleotides,
half perfectly match mRNA (PM),
half have one mismatch (MM)
Raw gene expression is intensity
difference: PM - MM
Agilent cDNA microarray and
oligonucelotides microarray
• Agilent delivering printed 60-mer
microarrays in addition to 25-mer formats.
• The inkjet process uses standard
phosphoramidite chemistry to deliver
extremely small volumes (picoliters) of the
chemicals to be spotted.
Content
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
The Workflow of Microarray
sample
Plate
Plate Preparation
RNA extraction
Array Fabrication
cDNA synthesis
and labeled
Array
Hybridization
Hybridized Array
Scanning
Labeled cDNA
cDNA Synthesis And Directly
Labeling
Cyanine [Cy3 and Cy5] cDNA
Hybridization On To The Chip
e.g. treatment / control
normal / tumor tissue
Sample loading
1.Loading from the corner of the
cover slip
It is time consuming and easily
producing bubbles.
1
2
Sample loading
3
Sample loading
2. Loading sample at the center
of array then put the slip
smoothly
Faster, and have lower chance of
bubble producing then the last one.
3. Loading sample at the side of
the array then put the slip on.
Solution would attach to the slip right
after the slip contact with it, and
would diffuse with the movement of
slip when we slowly move down.
Scan
Green: down regulate
Red: up regulate
Yellow: equal level
Content
•
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Image analysis
• To find a spot
• Convert feature into numeric data
• Image normalization
The Algorithms
1. Find spots: Finds the location of each spot on
the microarray.
2. Cookie cutter algorithm:
(1).Suppose the distribution of pixels vs
intensity is Gaussian curve
(2).Using SD or IQR to identify the feature and
background of each spot
(3).Calculates statistics for the pixel population
Interquartile Range(IQR)
D
K=IQR/2 1.42 IQR
Boundary for
rejection
25
%
50
%
IQR
75
%
Boundary for
rejection
Feature
or cookie
D
Exclusion
zone
Local
background
Data Quality
• Irregular size or
shape
• Irregular placement
• Low intensity
indistinguishable
saturated
• Saturation
• Spot variance
• Background variance
bad print
miss alignment
artifact
Convert Feature Into Numeric
Value
Green
Green b.g.-corrected Red b.g.-corrected
background
(R. b.g.-c)/(G. b.g.Red intensity
Green
c) Systematic name
intensity
Red b.g.
Gene function
A_1_1
A_1_2
A_1_3
A_1_4
A_1_5
A_1_6
A_1_7
A_1_8
A_1_9
A_1_10
A_1_11
A_1_12
A_1_13
A_1_14
A_1_15
A_1_16
A_1_17
A_1_18
A_1_19
A_1_20
A_1_21
A_1_22
A_1_23
A_1_24
A_1_25
A_1_26
Ctrl
Ctrl
Ctrl
Data
Data
Data
D x A - PSL
B kgd
sDxA
D x A - PSL
B kgd
sDxA
Ratio (sDxA): Data /
59358.75
512.92 58845.83 50953.13 1779.913 49173.22 0.835628 YAL003W
1209.19
512.92
696.271 2522.345 1779.913 742.4323 1.066298 YAR053W
1948.2
512.92
1435.28 3100.152 1779.913 1320.239 0.919848 YBL078C
4940.806
512.92 4427.886 6670.604 1779.913 4890.691 1.104521 YAL008W
1485.59
512.92
972.671 2916.086 1779.913 1136.173 1.168096 YAR062W
32642.03
512.92 32129.11 42304.13 1779.913 40524.22 1.261293 YBL087C
6919.441
512.92 6406.521 8540.246 1779.913 6760.333 1.055227 YAL014C
2698.301
512.92 2185.382
4314.47 1779.913 2534.557 1.159778 YAR068W
7167.958
512.92 6655.038 7379.286 1779.913 5599.373 0.841374 YBL100C
5470.062
512.92 4957.142 6953.799 1779.913 5173.886 1.043724 YAL025C
27879.49
512.92 27366.57
33746.9 1779.913 31966.99 1.168103 YBL002W
2589.613
512.92 2076.693 4385.568 1779.913 2605.655 1.254713 YBL107C
6196.245
512.92 5683.326 8840.475 1779.913 7060.562 1.242329 YDR044W
34737.1
512.92 34224.18 36129.62 1779.913
34349.7 1.003668 YDR134C
34035.35
512.92 33522.43 27128.53 1779.913 25348.62 0.756169 YDR233C
1638.381
512.92 1125.461 2988.042 1779.913 1208.129 1.073453 YDR048C
3873.718
512.92 3360.799 4955.141 1779.913 3175.228 0.944784 YDR139C
2433.625
512.92 1920.706 3502.406 1779.913 1722.493 0.896802 YDR252W
1800.736
512.92 1287.816 3011.855 1779.913 1231.942 0.956613 YDR053W
1296.689
512.92
783.77 2636.549 1779.913 856.6356 1.092968 YDR149C
3453.24
512.92
2940.32 4968.026 1779.913 3188.113 1.084274 YDR260C
10731.55
512.92 10218.63 9307.246 1779.913 7527.333 0.736629 YDR056C
6191.309
512.92
5678.39 8808.398 1779.913 7028.485
1.23776 YDR152W
3589.998
512.92 3077.078 4420.744 1779.913 2640.831 0.858227 YDR269C
27568.34
512.92 27055.42
20856.2 1779.913 19076.29 0.705082 YGL189C
1956.182
512.92 1443.262 3150.716 1779.913 1370.803 0.949795 YGL261C
Ctrl
translation elongation factor eef1beta
hypothetical protein
essential for autophagy
protein of unknown function
putative pseudogene
60s large subunit ribosomal protein l23.e
strong similarity to hypothetical protein yhr214w
questionable orf
nuclear viral propagation protein
histone h2b.2
hypothetical protein
coproporphyrinogen iii oxidase
strong similarity to flo1p, flo5p, flo9p and ylr110
similarity to hypothetical protein ydl204w
questionable orf
ubiquitin-like protein
strong similarity to egd1p and to human btf3 pro
questionable orf
questionable orf
hypothetical protein
hypothetical protein
weak similarity to c.elegans hypothetical protein
questionable orf
40s small subunit ribosomal protein s26e.c7
strong similarity to members of the srp1/tip1 fa
Data Normalization
• Normalize data to correct for variances
– Dye bias
– Location bias
– Intensity bias
– Pin bias
– Slide bias
• Control vs. non-control spots
Data Normalization
Uncalibrated, red light
under detected
Calibrated, red and green
equally detected
Data Normalization
• Assumptions
– Overall mean average ratio should be 1
• Most genes are not differentially expressed
– Total intensity of dyes are equivalent
Intensity Dependent
Normalization
After Normalization
Additional Normalization
• Pin dependent
– Similar to intensity dependent fit.
– Compute individual lowess fits for each pin
group
• Within slide normalization
– After pin dependent normalization, log ratios
for each pin are centered around 0
– Scale variance for each pin
• Uses MAD (median absolute deviation)
Additional Normalization
• Dye swap
– Combine relative expression levels without
explicit normalization
– Compute lowess fit for
log2(RR’/GG’)/2 vs. log2(A + A’)/2
– Normalized ratio is
log2(R/G) - c(A)
where c(A) is the lowess prediction
Content
•
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Data analysis
•
•
•
•
•
Data filtering
Fold change analysis
Classification
Clustering
Future direction
Microarray Data Classification
Microarray chips
Images scanned by laser
Value
193
-70
144
33
318
1764
1537
1204
707
Datasets
New
sample
Prediction:
Gene
D26528_at
D26561_cds1_at
D26561_cds2_at
D26561_cds3_at
D26579_at
D26598_at
D26599_at
D26600_at
D28114_at
Data Mining
and analysis
Class Sno D26528 D63874 D63880 …
ALL
2
193
4157
556
ALL
3
129 11557
476
ALL
4
44 12125
498
ALL
5
218
8484
1211
AML
51
109
3537
131
AML
52
106
4578
94
AML
53
211
2431
209
…
The Threshold of Spots:
selection/removing of genes
• Filtering - remove genes with insufficient
variation
– Remove insufficient spot:
saturated, None uniform, too high
background…
– Remove extreme signal:
e.g. MaxVal - MinVal < 500 and
MaxVal/MinVal < 5
– Statistical filtering (e.g. p-value<0.01)
– biological reasons
– feature reduction for algorithmic
Microarray Data Analysis Types
• Different gene expression
–Fold change analysis
• Classification (Supervised)
–identify disease
–predict outcome / select best treatment
• Clustering (Unsupervised)
–find new biological classes / refine existing
ones
–exploration
•…
Differential Gene Expression
• n-fold change
– n typically >= 2
– May hold no biological relevance
– Often too restrictive
• 2 expression
– Calculate standard deviation 
– Genes with expression more than 2 away
are differentially expressed
Fold Changes-Scatter Plot
72
(raw)
10000
1000
100
10
1
0.1
21
72 (control)
0.01
1
10
100
1000
10000
Fold Changes Table
Description
Genebank
accession
No.
6h
24 h
48 h
72 h
Fold Change Fold Change Fold Change Fold Change
Group 1
caspase 10, apoptosis-related cysteine protease U60519
-
-
-
0.471
CASP8 and FADD-like apoptosis regulator
U97075
nucleoside diphosphate kinase type 6 (inhibitor
of p53-induced apoptosis-alpha)
AF051941
-
-
-
0.355
-
-
-
0.376
Group 2
caspase 3, apoptosis-related cysteine protease
U13738
-
2.301
-
-
CASP8 and FADD-like apoptosis regulator
AF005775
-
2.272
-
-
U60521
-
-
2.519
-
Z48810
2.615
-
2.796
2.819
Group 3
caspase 9, apoptosis-related cysteine protease
Group 4
caspase 4, apoptosis-related cysteine protease
Group 5
inhibitor of apoptosis protein
AAF19819
-
-
-
5.249
caspase 7, apoptosis-related cysteine protease
U67319
-
-
-
2.19
caspase 4, apoptosis-related cysteine protease
U28976
-
-
-
2.603
AF015450
-
-
-
6.912
Group 6
23
CASP8 and FADD-like apoptosis regulator
Classification / categorization
Similar Approach:
• select top genes most correlated to each
class
• select best subset using cross-validation
• build a single model separating all classes
• Advanced:
–build separate model for each class vs. rest
–choose model making the strongest prediction
Clustering
Goals
• Find natural classes in the data
• Identify new classes / gene correlations
• Refine existing taxonomies
• Support biological analysis / discovery
• Different Methods
–Hierarchical clustering, SOM's, etc
SOM clustering
• SOM - self organizing maps
• Preprocessing
–filter away genes with insufficient biological
variation
–normalize gene expression (across samples) to
mean 0, st. dev 1, for each gene separately.
• Run SOM for many iterations
• Plot the results
SOM & K Mean By GeneSpring
27
Hierarchical Clustering
• The most popular hierarchical clustering
method used in microarray data analysis is
the so called agglomerative method
– works with the data in a bottom-up manner.
• Initially, each data point forms a cluster and the
algorithm works through the cluster sets by
repeatedly merging the two which are the most
similar or have the shortest distance.
– algorithm involves the computation of the
distance or similarity matrix
• O(N^2) complexity and thus is not very efficient.
Hierarchical clustering
Genomic Reprogramming in Response to Oxidant
minutes
0 10 20 40 60 120
One-third of genome expression is
transiently reprogrammed
6218 genes
Fold repression
>9
>6
>3
Fold induction
1:1
>3
>6 >9
Integrate biological knowledge when
analyzing microarray data (from Cheng
Li, Harvard SPH)
Right picture: Gene Ontology: tool for the unification of biology, Nature Genetics, 25, p25
Content
•
•
•
•
•
•
Biology background of microarray
Design of microarray
The workflow of microarray
Image analysis of microarray
Data analysis of microarray
Discussion
Microarray Potential
Applications
• Biological discovery
–
–
–
–
new and better molecular diagnostics
new molecular targets for therapy
finding and refining biological pathways
Mutation and polymorphism detection
• Recent examples
– molecular diagnosis of leukemia, breast cancer, ...
– appropriate treatment for genetic signature
– potential new drug targets
Microarray Limitations




Cross-hybridization of sequences with high identity
Chip to chip variation
True measure of abundance?
Does mRNA levels reflect protein levels?
 Generally, do not “prove” new biology - simply suggest genes
involved in a process, a hypothesis that will require traditional
experimental verification.
 What fold change has biological relevance?
 Need cloned EST or some sequence knowledge -- rare
messages may be undetected
 Expensive!! Not every lab can afford experiment repeat.
 The real limitation is Bioinformatics