Download Lecture 2

Document related concepts

Replisome wikipedia , lookup

Pathogenomics wikipedia , lookup

Point mutation wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Gene expression profiling wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Primary transcript wikipedia , lookup

Human genome wikipedia , lookup

Epigenomics wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Non-coding DNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Designer baby wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Genomic library wikipedia , lookup

Genome-wide association study wikipedia , lookup

Microevolution wikipedia , lookup

Genome editing wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Tag SNP wikipedia , lookup

SNP genotyping wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Transcript
High-density short oligonucleotide
microarrays: a low-level viewpoint
Terry Speed, UC Berkeley
Hotelling Lecture #2
March 28, 2007
1
Plan
To introduce the chips, the probes, their uses, the
associated assays, and the impact of all these on
what I will call the low-level statistical analyses.
I’ll discuss 2 or 3 examples in a little more detail,
and touch on several others.
For simplicity, I concentrate on Affymetrix arrays,
called GeneChips©, though very similar results
apply to the Nimblegen arrays.
2
Generic Affymetrix chip
*
*
*
*
*
*
5µ
5µ
> 1 million identical
25bp probes / feature
1.28cm
1.28cm
6,500,000 features / chip
Feature size has gone from 100µ to 18µ to 11µ and is now 5µ;
1µ, 0.8µ coming soon, with a huge increase in # of features/chip.
Affymetrix has traditionally used
both PM and MM probes
Target TAGCCATCGGTAAGTACTCAATGAT
Perfect Match ATCGGTAGCCATTCATGAGTTACTA
Target TAGCCATCGGTAAGTACTCAATGAT
MisMatch ATCGGTAGCCATACATGAGTTACTA
The MM probes are supposed to permit correction for
background. In some recent products they have been dropped,
4
their value not justifying use of half the area on the chip.
Cartoon version of the process
(courtesy of Rafael Irizarry)
followed by some real images
5
After fragmenting, before target labelling
Target fragments
Probe sequences fixed to chip
6
After labelling, before hybridization
Labelled s.s. target DNA
7
After hybridization and wash
8
Ideal quantification
4
2
0
3
9
Actual quantification:
Imaging by a scanner
10
More realistically (R=hi, B=lo)
PM
MM
Probe2 Probe1
5’-
-3’11
Image analysis
Features (PM or MM) are pixellated. One method of calling SNPs
(DM) uses pixel-level data. Others, including the ones discussed
12
today use a summary value, typically the 75th %ile.
Gene expression arrays
Using the standard 3'-biased probe sets.
There is a more extreme 3'-biased chip (Human X3P) with all probes in
the 300 bp at the 3' end of genes: for use with FFPE samples; it uses
a different assay, and no doubt presents a more challenging analysis.
13
Probe selection
Probes are selected from the 3'-end of a target mRNA
sequence. 5-50K target fragments are interrogated by
probe sets of 11-20 probes. Affymetrix uses PM and MM
probes. Probe selection is an important issue.
14
The standard assay: beginning
T7-Oligo(dT) Promoter Primer
15
www.affymetrix.com
The assay: continued
In vitro transcription
T7 RNA polymerase
16
The assay: completed
•
•
•
A grid is superimposed on the chip image
Feature (= probe) level summaries are calculated
We now have intensities for 11-20 probes per
probe set (≈ gene). These need to be summarized
into a single number representing mRNA
abundance ≈ gene expression level.
17
Chip dat file – checkered board – oligo B2
18
Chip dat file – checkered board – close up
19
Chip dat file – checkered board – close up w/ grid
20
probe set with 6/16 probes not
responding across 13 chips (here concentrations)
Probe intensity
Probe
level
data:
Probe
Intensity
vs conc
ex 2 a
Concentration (pM)
21
Principal low-level analysis steps
•
•
•
•
Background adjustment
Normalization at probe level
These steps to remove lab/operator/reagent effects
Combining probe level summaries to probe set level
summary: best done robustly, on many chips at once
The last to remove probe affinity effects and discordant
observations (gross errors/non-responding probes, etc)
Possibly a further round of normalization (probe set level)
as lab/cohort/batch effects are frequently still visible
Finally, quality assessment is an important low-level task.
22
Low-level analysis problems haven’t
been solved once and for all; why?
•
•
•
Novel uses for the chips arise, and with them novel assays;
each time, all the issues just mentioned need revisiting
The feature size keeps getting smaller
Fewer and fewer features are used for a given measurement,
allowing more measurements to be made using a single chip
These considerations all place more and more demands on
the low-level analysis: to maintain the quality of existing
measurements, and to obtain good new ones.
23
A novel expression study:
translation in the Drosophila embryo
24
A gene being actively translated:
α-catenin
0-2 h
4-6 h
8-10 h
%
QRT-PCR
0-2 hours
25%
4-6hours
20%
8-10 hours
15%
log
10%
5%
0%
1
2
3
4
5
6
7
8
9
10 11 12
Fraction
EDTA
25
A translationally silent gene:
RpL36
0-2 h
8-10 h
%
QRT-PCR
80%
0-2 hours
70%
8-10 hours
60%
4-6 h
50%
40%
30%
log
?
20%
10%
0%
1
2
3
4
5
6
7
8
9
10 11 12
Fraction
EDTA
26
The novel low-level challenge with
this particular experiment
Normalization: to account for very different amounts of
mRNA in the different fractions: we failed here.
27
Gene expression arrays, 2
Human Exon ST arrays
28
Probes
•
•
•
•
Probes are selected from the entire coding
content, including predicted exons, typically 4
probes per exon (PM only, no MMs, but a pile of
control probes)
No non-coding probes (cf. tiling array).
"Genes" can be represented by a few or
hundreds of probes.
Total number of probes on the chip is over 5
million (1.4 million probe sets).
29
Modified assay: now takes 3+ days
Roughly as before, and then
Uracil DNA glycolsylase
Apurinic/apyrimidinic
endonuclease
Terminal deoxynucleotidyl
Transferase linked to biotin
30
Ex: CD44 on a few cell lines (t of previous)
12
for_Miki.x
ls
10
CD44_check
log2expression
8
6
4
B12-B
C1
C2
C3
C4
C5
C6
L3
2
variant exons v2-10
0
0
5
10
15
20
B12: MDA231
L3: MCF7
Probe sets (of size 4) 1-27 →
v8
B12 - MDA231
L3 - MCF7
v8-10 - ps17-19
25
v9
v10
Each colored line a different cell line
Several novel low-level challenges
Summarizing possibly 100s of probes for a “gene”.
Detecting alternative splicing in (sample, exon) pairs.
[Our robust summarization method RMA does much
of this more or less automatically, giving weights.
Simple summarization was needed to get some
preliminary results, but this all needs improving.
Still early days. In the next slides, lower weights are
darker.]
32
Positive control: probe-wise scores
Samples (rows, in triplicate) x probes (columns, in quartets): entry RMA weight.
33
Positive control: probe set summaries
34
Negative control: probe-wise scores
35
Negative control: probe set summaries
36
Identification of protein binding sites
Whole genome tiling arrays
Promoter arrays
37
Chromatin immunoprecipitation
Summary of ChIP assay:
– Cross-link all proteins to genomic DNA.
– Fragment DNA (with cross-linked proteins still
attached).
– ChIP: enrichment of TF-associated fragments.
– Reverse cross-linking and purify DNA.
Identification of enriched DNA regions:
– Clone, sequence and map back to the genome
– For anticipated binding sites, PCR with specific
primers
– Amplify non-specifically, then hybridize to tiling
array (ChIP-chip)
38
Taking log ratios works well
Here it already looks good (Drosophila arrays). When converted
to p-values (not strictly a low-level task) this becomes quite striking.
39
Novel low level challenge
Which probe-level statistic should we use?
Deciding when we have a binding site. In essence this
is yet another variant on combining information across
probes, but this time we have a shape to detect.
Statistical approaches:
– Two-state hidden Markov models.
Li, Meyer and Liu, Bioinformatics, ‘05; TileMap
– Smoothed or windowed probe-level statistics
Cawley et al., Cell, ‘04; Keles et al., ’04; MAT;
Buck, Nobel and Lieb, Genome Biology ’05
– Ad hoc post-processing of probe-level calls
– Peak fitting
40
Expected log-ratio at a binding sites
Simple model:
Peak amplitude and
width depend on the
efficiency ratio φ′/φ,
where φ and φ′ are
the probabilities of
fragments without
and with binding
sites passing the IP
(φ′ is binding-site
specific). Here θ is
the probability of a
break at any base.
The model includes
a multiplier for PCR.
41
Other uses of ChIP-chip to date
To study replication timing (arrest cells in G1/S, HU, BrdU)
To map origins of replication (arrest cells in G2/M, use HU and
BrdU)
To map histone modifications (AcK9, 2MeK9, 3MeK9 of H3, etc)
To map methylation
To map DNAse hypersensitivity
Each will need its own low-level analysis to get the best out of
the data, as the “shape” of a positive signal will differ in the
different cases.
42
Gene expression arrays, 3
Whole-genome tiling arrays: for the unbiased mapping
of transcription of both coding and non-coding genes.
43
Probes
Every 35 bp, i.e. 10 bp spacing between 25 mers.
The D. melanogaster tiling array contains > 3 million PM
probes covering ≈106 Mb of the euchromatic portion of the fly
genome, excluding repetitive DNA. The human “array” is 7 (14)
chips
Assay
Random priming at the initial RT step, similar to the one for the
Exon array, but leading to double-stranded target DNA.
Low-level analysis challenges
Similar to those with the Exon array, plus the added challenge
of reliably identifying mRNAs from un-annotated parts of the
genome, e.g. ncRNAs and miRNAs.
44
Around Rps9: as expected
45
Around TORC: something new
Transcription away from coding regions
46
Single Nucleotide Polymorphism (SNP) chips
For SNP genotyping
47
Affymetrix SNP chip terminology
Genomic DNA
SNP
A
TAGCCATCGGTA
TAGCCATCGGTANGTACTCAATGAT
GTACTCAATGAT
G
Perfect Match probe for Allele A
ATCGGTAGCCATTCATGAGTTACTA
Perfect Match probe for Allele B
ATCGGTAGCCATCCATGAGTTACTA
Genotyping: answering the question about the two
copies of the chromosome on which the SNP is located:
Is a sample AA (AA) , AB (AG) or BB (GG) at this SNP?
48
Original SNP probe tiling strategy
SNP 0 position
A/G
TAGCCATCGGTA N GTACTCAATGAT
PM 0 Allele A
MM 0 Allele A
ATCGGTAGCCAT T CATGAGTTACTA
ATCGGTAGCCAT A CATGAGTTACTA
PM 0 Allele B
MM 0 Allele B
ATCGGTAGCCAT C CATGAGTTACTA
ATCGGTAGCCAT G CATGAGTTACTA
Central probe quartet
49
Original SNP probe tiling strategy, 2
SNP
A / G +4 Position
TAGCCATCGGTA N GTA C TCAATGATCAGCT
PM +4 Allele A
MM +4 Allele A
GTAGCCAT T CAT G AGTTACTAGTCG
GTAGCCAT T CAT C AGTTACTAGTCG
PM +4 Allele B
MM +4 Allele B
GTAGCCAT C CAT G AGTTACTAGTCG
GTAGCCAT C CAT C AGTTACTAGTCG
+4 offset probe quartet
50
Original SNP probe tiling strategy, 3
Offset quartets
Central quartet
Offset quartets
1
2
3
4
5
6
7
PMA
PMA
PMA
PMA
PMA
PMA
PMA
MMA
MMA
MMA
MMA
MMA
MMA
MMA
PMB
PMB
PMB
PMB
PMB
PMB
PMB
MMB
MMB
MMB
MMB
MMB
MMB
MMB
Repeated on the opposite strand: 56 probes for 10K.
100K had 40; current 0.5M has 8; 1M to come has 6 (!)
Single primer assay: cartoon version
250 ng Genomic DNA
Xba
Xba
Xba
RE digestion
Single Primer
Amplification
Adapter
ligation
Fragmentation
and Labeling
Hyb & scan on
standard hardware
52
Single primer assay: complexity
reduction
53
Probe Intensities on the Chip
Fake (idealized) image for 3 samples on one SNP
Sample1
Genotype=AA
Sample2
Genotype=AB
Sample3
Genotype=BB
Fake, as the probes are not all adjacent on the chip
Idealized, as all the probes are high or low as they should be.
54
In summary: probes used
•
•
Must represent both alleles A and B
May or may not be paired
Probes can be chosen as best performers from
• Either PM or MM (no contest)
• Up to three offsets on either side of the SNP
• On either the forward or reverse strand
• May or may not be replicated
55
Data on one sample for SNP A-1507972
PQ = Probe Quartet.
PQ1
PQ2
PQ3
PQ4
PQ5
PQ6
PQ7
PQ8
PQ9
PQ10
PMA
1812
3069
1886
2159
1327
5487
2429
1878
2928
3894
MMA
420
841
466
2036
643
1107
612
718
2240
1431
PMB
8513 10610
8984
7609
MMB
1194
1343
668
2638
Sense (+) strand
5059 21145 15637 15810 12293 12964
1232
9472
5253
3073
3215
8533
Antisense (-) strand
What is the genotype of this person?
We now look at such data on 90 individuals, ignoring MMs.
56
Boxplots of log2PM for 90 samples by probe:
PMA1, PMB1…sense probes first, then anti-sense.
Observations: B probe intensities larger overall, and less
variable. More BB and AB. Anti-sense probe intensities
are slightly larger than sense probe intensities.
Importantly: we see probe-specific mean levels.
SNP A-1507972
57
Boxplots of log2PM by sample: SNP A-1507972
Observations: the median level varies, the spread varies a lot.
Can you guess genotypes by shape here?
AA and BB have larger spreads, AB smaller.
58
Comment on genotyping
It is relatively easy to get 95% correct (HapMap truth, say).
It is moderately difficult, but possible to get 99.5% right, and
not have any homozygote-heterozygote bias
It is quite hard to get 99.95% right.
Remember that we are talking of genotyping .5M-1M
SNPs, on 100s or 1000s of people.
59
A methodological opportunity
Having up to a million “isomorphic” classification problems
per individual cries out for a parallel treatment.
The RLMM/BRLMM strategy is to keep the individual
classifiers simple - Mahalanobis - and put empirical priors
over the parameters (means, variances) in the models.
This hierarchical approach is a form of shrinkage (aka
borrowing strength), and definitely seems to help.
Needless to say, computer scientists/machine learners
have been here before us, though not apparently with
such a practical problem. They call it multi-task learning.
60
There are very clear lab effects
1, 2 and 3 are different labs; ellipses set by 1
61
Why is this?
•
Our guess is that the PCR step introduces a lot of
SNP to SNP variation
•
We have proxies for measuring PCR effect:
fragment sequence and fragment length and
overall intensity.
•
We can examine the fragment sequence via the
probe sequence
62
There are sequence effects
Lab 1
Lab 2
Lab 3
Location of base along the 25 bp probe
63
Fragment length effects
which are lab specific
Lab means superimposed.
64
As are sequence effects
Log ratios across chips, stratified
by SNP base pair and lab.
65
Effects of our recent normalization
66
You saw the Before, this is After (and a different representation)
Where are we with SNP chips?
The 10K SNP chip used MPAM. Worked well, but not on
The 100K SNP chip, for which DM was developed. Worked
well, but not on
The 500K SNP chip, for which BRLMM now works pretty well.
How well it will work for the 1M SNP chip, we’ll find out soon.
This scale-up by a factor of 50-100 puts many demands on
the low-level analysis methods.
67
Other low-level problems on my plate
Genotyping with pooled DNA
Loss of heterozygosity
Chromosomal copy number
Allele-specific copy number
Allele-specific expression
Re-sequencing
68
Closing remark
Low-level analysis problems will keep arising as new uses for
the chips are devised, and as scaling down (feature size)
scaling up (feature numbers) and multiplexing continue.
With good assays and clear signals (low-hanging fruit), any
reasonable method should give satisfactory results, but to get
the best out of the data - on a genome-wide scale - more
effort is needed.
69
Acknowledgements
Xiaoli Qin, Gerry Rubin,
Soyeon Ahn, Daphne Georlette Anna Lapuk, John Conboy &
Joe Gray, LBNL
UC Berkeley
Ken Simpson, WEHI
Nusrat Rabbee,
Genentech
Mark Biggins, LBNL
Simon Cawley and the team
Richard Bourgon, EMBL
Affymetrix
Henrik Bengtsson
Rafael Irizarry & Benilton Carvalho
UC Berkeley
Johns Hopkins
Andrew Sinclair & Howard
Slater MCRI
Charles Glatt
Cornell Medical College
70
Thank you for listening!
71
SNP chips can be used to determine
LOH and locus copy number
Some sample figures based on the 500K SNP chip
showing deletions and amplifications
72
Size = 424.3 kb, Number of SNPs = 118
Results using of dChip and GLAD.
73
Size = 264 kb, Number of SNPs = 72
74
Size = 168 kb, Number of SNPs = 55
75
Size = 52.9 kb, Number of SNPs = 17
We’re working hard to increase the resolution of this approach.
76
Allele-specific gene expression
Hybrid gene-expression/genotyping arrays
77
Allele-specific expression
78
P.V. Krishna Pant et al. Genome Res. 2006; 16: 331-339
79