Download Lecture 2

High-density short oligonucleotide microarrays: a low-level viewpoint Terry Speed, UC Berkeley Hotelling Lecture #2 March 28, 2007 1 Plan To introduce the chips, the probes, their uses, the associated assays, and the impact of all these on what I will call the low-level statistical analyses. I’ll discuss 2 or 3 examples in a little more detail, and touch on several others. For simplicity, I concentrate on Affymetrix arrays, called GeneChips©, though very similar results apply to the Nimblegen arrays. 2 Generic Affymetrix chip * * * * * * 5µ 5µ > 1 million identical 25bp probes / feature 1.28cm 1.28cm 6,500,000 features / chip Feature size has gone from 100µ to 18µ to 11µ and is now 5µ; 1µ, 0.8µ coming soon, with a huge increase in # of features/chip. Affymetrix has traditionally used both PM and MM probes Target TAGCCATCGGTAAGTACTCAATGAT Perfect Match ATCGGTAGCCATTCATGAGTTACTA Target TAGCCATCGGTAAGTACTCAATGAT MisMatch ATCGGTAGCCATACATGAGTTACTA The MM probes are supposed to permit correction for background. In some recent products they have been dropped, 4 their value not justifying use of half the area on the chip. Cartoon version of the process (courtesy of Rafael Irizarry) followed by some real images 5 After fragmenting, before target labelling Target fragments Probe sequences fixed to chip 6 After labelling, before hybridization Labelled s.s. target DNA 7 After hybridization and wash 8 Ideal quantification 4 2 0 3 9 Actual quantification: Imaging by a scanner 10 More realistically (R=hi, B=lo) PM MM Probe2 Probe1 5’- -3’11 Image analysis Features (PM or MM) are pixellated. One method of calling SNPs (DM) uses pixel-level data. Others, including the ones discussed 12 today use a summary value, typically the 75th %ile. Gene expression arrays Using the standard 3'-biased probe sets. There is a more extreme 3'-biased chip (Human X3P) with all probes in the 300 bp at the 3' end of genes: for use with FFPE samples; it uses a different assay, and no doubt presents a more challenging analysis. 13 Probe selection Probes are selected from the 3'-end of a target mRNA sequence. 5-50K target fragments are interrogated by probe sets of 11-20 probes. Affymetrix uses PM and MM probes. Probe selection is an important issue. 14 The standard assay: beginning T7-Oligo(dT) Promoter Primer 15 www.affymetrix.com The assay: continued In vitro transcription T7 RNA polymerase 16 The assay: completed • • • A grid is superimposed on the chip image Feature (= probe) level summaries are calculated We now have intensities for 11-20 probes per probe set (≈ gene). These need to be summarized into a single number representing mRNA abundance ≈ gene expression level. 17 Chip dat file – checkered board – oligo B2 18 Chip dat file – checkered board – close up 19 Chip dat file – checkered board – close up w/ grid 20 probe set with 6/16 probes not responding across 13 chips (here concentrations) Probe intensity Probe level data: Probe Intensity vs conc ex 2 a Concentration (pM) 21 Principal low-level analysis steps • • • • Background adjustment Normalization at probe level These steps to remove lab/operator/reagent effects Combining probe level summaries to probe set level summary: best done robustly, on many chips at once The last to remove probe affinity effects and discordant observations (gross errors/non-responding probes, etc) Possibly a further round of normalization (probe set level) as lab/cohort/batch effects are frequently still visible Finally, quality assessment is an important low-level task. 22 Low-level analysis problems haven’t been solved once and for all; why? • • • Novel uses for the chips arise, and with them novel assays; each time, all the issues just mentioned need revisiting The feature size keeps getting smaller Fewer and fewer features are used for a given measurement, allowing more measurements to be made using a single chip These considerations all place more and more demands on the low-level analysis: to maintain the quality of existing measurements, and to obtain good new ones. 23 A novel expression study: translation in the Drosophila embryo 24 A gene being actively translated: α-catenin 0-2 h 4-6 h 8-10 h % QRT-PCR 0-2 hours 25% 4-6hours 20% 8-10 hours 15% log 10% 5% 0% 1 2 3 4 5 6 7 8 9 10 11 12 Fraction EDTA 25 A translationally silent gene: RpL36 0-2 h 8-10 h % QRT-PCR 80% 0-2 hours 70% 8-10 hours 60% 4-6 h 50% 40% 30% log ? 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 Fraction EDTA 26 The novel low-level challenge with this particular experiment Normalization: to account for very different amounts of mRNA in the different fractions: we failed here. 27 Gene expression arrays, 2 Human Exon ST arrays 28 Probes • • • • Probes are selected from the entire coding content, including predicted exons, typically 4 probes per exon (PM only, no MMs, but a pile of control probes) No non-coding probes (cf. tiling array). "Genes" can be represented by a few or hundreds of probes. Total number of probes on the chip is over 5 million (1.4 million probe sets). 29 Modified assay: now takes 3+ days Roughly as before, and then Uracil DNA glycolsylase Apurinic/apyrimidinic endonuclease Terminal deoxynucleotidyl Transferase linked to biotin 30 Ex: CD44 on a few cell lines (t of previous) 12 for_Miki.x ls 10 CD44_check log2expression 8 6 4 B12-B C1 C2 C3 C4 C5 C6 L3 2 variant exons v2-10 0 0 5 10 15 20 B12: MDA231 L3: MCF7 Probe sets (of size 4) 1-27 → v8 B12 - MDA231 L3 - MCF7 v8-10 - ps17-19 25 v9 v10 Each colored line a different cell line Several novel low-level challenges Summarizing possibly 100s of probes for a “gene”. Detecting alternative splicing in (sample, exon) pairs. [Our robust summarization method RMA does much of this more or less automatically, giving weights. Simple summarization was needed to get some preliminary results, but this all needs improving. Still early days. In the next slides, lower weights are darker.] 32 Positive control: probe-wise scores Samples (rows, in triplicate) x probes (columns, in quartets): entry RMA weight. 33 Positive control: probe set summaries 34 Negative control: probe-wise scores 35 Negative control: probe set summaries 36 Identification of protein binding sites Whole genome tiling arrays Promoter arrays 37 Chromatin immunoprecipitation Summary of ChIP assay: – Cross-link all proteins to genomic DNA. – Fragment DNA (with cross-linked proteins still attached). – ChIP: enrichment of TF-associated fragments. – Reverse cross-linking and purify DNA. Identification of enriched DNA regions: – Clone, sequence and map back to the genome – For anticipated binding sites, PCR with specific primers – Amplify non-specifically, then hybridize to tiling array (ChIP-chip) 38 Taking log ratios works well Here it already looks good (Drosophila arrays). When converted to p-values (not strictly a low-level task) this becomes quite striking. 39 Novel low level challenge Which probe-level statistic should we use? Deciding when we have a binding site. In essence this is yet another variant on combining information across probes, but this time we have a shape to detect. Statistical approaches: – Two-state hidden Markov models. Li, Meyer and Liu, Bioinformatics, ‘05; TileMap – Smoothed or windowed probe-level statistics Cawley et al., Cell, ‘04; Keles et al., ’04; MAT; Buck, Nobel and Lieb, Genome Biology ’05 – Ad hoc post-processing of probe-level calls – Peak fitting 40 Expected log-ratio at a binding sites Simple model: Peak amplitude and width depend on the efficiency ratio φ′/φ, where φ and φ′ are the probabilities of fragments without and with binding sites passing the IP (φ′ is binding-site specific). Here θ is the probability of a break at any base. The model includes a multiplier for PCR. 41 Other uses of ChIP-chip to date To study replication timing (arrest cells in G1/S, HU, BrdU) To map origins of replication (arrest cells in G2/M, use HU and BrdU) To map histone modifications (AcK9, 2MeK9, 3MeK9 of H3, etc) To map methylation To map DNAse hypersensitivity Each will need its own low-level analysis to get the best out of the data, as the “shape” of a positive signal will differ in the different cases. 42 Gene expression arrays, 3 Whole-genome tiling arrays: for the unbiased mapping of transcription of both coding and non-coding genes. 43 Probes Every 35 bp, i.e. 10 bp spacing between 25 mers. The D. melanogaster tiling array contains > 3 million PM probes covering ≈106 Mb of the euchromatic portion of the fly genome, excluding repetitive DNA. The human “array” is 7 (14) chips Assay Random priming at the initial RT step, similar to the one for the Exon array, but leading to double-stranded target DNA. Low-level analysis challenges Similar to those with the Exon array, plus the added challenge of reliably identifying mRNAs from un-annotated parts of the genome, e.g. ncRNAs and miRNAs. 44 Around Rps9: as expected 45 Around TORC: something new Transcription away from coding regions 46 Single Nucleotide Polymorphism (SNP) chips For SNP genotyping 47 Affymetrix SNP chip terminology Genomic DNA SNP A TAGCCATCGGTA TAGCCATCGGTANGTACTCAATGAT GTACTCAATGAT G Perfect Match probe for Allele A ATCGGTAGCCATTCATGAGTTACTA Perfect Match probe for Allele B ATCGGTAGCCATCCATGAGTTACTA Genotyping: answering the question about the two copies of the chromosome on which the SNP is located: Is a sample AA (AA) , AB (AG) or BB (GG) at this SNP? 48 Original SNP probe tiling strategy SNP 0 position A/G TAGCCATCGGTA N GTACTCAATGAT PM 0 Allele A MM 0 Allele A ATCGGTAGCCAT T CATGAGTTACTA ATCGGTAGCCAT A CATGAGTTACTA PM 0 Allele B MM 0 Allele B ATCGGTAGCCAT C CATGAGTTACTA ATCGGTAGCCAT G CATGAGTTACTA Central probe quartet 49 Original SNP probe tiling strategy, 2 SNP A / G +4 Position TAGCCATCGGTA N GTA C TCAATGATCAGCT PM +4 Allele A MM +4 Allele A GTAGCCAT T CAT G AGTTACTAGTCG GTAGCCAT T CAT C AGTTACTAGTCG PM +4 Allele B MM +4 Allele B GTAGCCAT C CAT G AGTTACTAGTCG GTAGCCAT C CAT C AGTTACTAGTCG +4 offset probe quartet 50 Original SNP probe tiling strategy, 3 Offset quartets Central quartet Offset quartets 1 2 3 4 5 6 7 PMA PMA PMA PMA PMA PMA PMA MMA MMA MMA MMA MMA MMA MMA PMB PMB PMB PMB PMB PMB PMB MMB MMB MMB MMB MMB MMB MMB Repeated on the opposite strand: 56 probes for 10K. 100K had 40; current 0.5M has 8; 1M to come has 6 (!) Single primer assay: cartoon version 250 ng Genomic DNA Xba Xba Xba RE digestion Single Primer Amplification Adapter ligation Fragmentation and Labeling Hyb & scan on standard hardware 52 Single primer assay: complexity reduction 53 Probe Intensities on the Chip Fake (idealized) image for 3 samples on one SNP Sample1 Genotype=AA Sample2 Genotype=AB Sample3 Genotype=BB Fake, as the probes are not all adjacent on the chip Idealized, as all the probes are high or low as they should be. 54 In summary: probes used • • Must represent both alleles A and B May or may not be paired Probes can be chosen as best performers from • Either PM or MM (no contest) • Up to three offsets on either side of the SNP • On either the forward or reverse strand • May or may not be replicated 55 Data on one sample for SNP A-1507972 PQ = Probe Quartet. PQ1 PQ2 PQ3 PQ4 PQ5 PQ6 PQ7 PQ8 PQ9 PQ10 PMA 1812 3069 1886 2159 1327 5487 2429 1878 2928 3894 MMA 420 841 466 2036 643 1107 612 718 2240 1431 PMB 8513 10610 8984 7609 MMB 1194 1343 668 2638 Sense (+) strand 5059 21145 15637 15810 12293 12964 1232 9472 5253 3073 3215 8533 Antisense (-) strand What is the genotype of this person? We now look at such data on 90 individuals, ignoring MMs. 56 Boxplots of log2PM for 90 samples by probe: PMA1, PMB1…sense probes first, then anti-sense. Observations: B probe intensities larger overall, and less variable. More BB and AB. Anti-sense probe intensities are slightly larger than sense probe intensities. Importantly: we see probe-specific mean levels. SNP A-1507972 57 Boxplots of log2PM by sample: SNP A-1507972 Observations: the median level varies, the spread varies a lot. Can you guess genotypes by shape here? AA and BB have larger spreads, AB smaller. 58 Comment on genotyping It is relatively easy to get 95% correct (HapMap truth, say). It is moderately difficult, but possible to get 99.5% right, and not have any homozygote-heterozygote bias It is quite hard to get 99.95% right. Remember that we are talking of genotyping .5M-1M SNPs, on 100s or 1000s of people. 59 A methodological opportunity Having up to a million “isomorphic” classification problems per individual cries out for a parallel treatment. The RLMM/BRLMM strategy is to keep the individual classifiers simple - Mahalanobis - and put empirical priors over the parameters (means, variances) in the models. This hierarchical approach is a form of shrinkage (aka borrowing strength), and definitely seems to help. Needless to say, computer scientists/machine learners have been here before us, though not apparently with such a practical problem. They call it multi-task learning. 60 There are very clear lab effects 1, 2 and 3 are different labs; ellipses set by 1 61 Why is this? • Our guess is that the PCR step introduces a lot of SNP to SNP variation • We have proxies for measuring PCR effect: fragment sequence and fragment length and overall intensity. • We can examine the fragment sequence via the probe sequence 62 There are sequence effects Lab 1 Lab 2 Lab 3 Location of base along the 25 bp probe 63 Fragment length effects which are lab specific Lab means superimposed. 64 As are sequence effects Log ratios across chips, stratified by SNP base pair and lab. 65 Effects of our recent normalization 66 You saw the Before, this is After (and a different representation) Where are we with SNP chips? The 10K SNP chip used MPAM. Worked well, but not on The 100K SNP chip, for which DM was developed. Worked well, but not on The 500K SNP chip, for which BRLMM now works pretty well. How well it will work for the 1M SNP chip, we’ll find out soon. This scale-up by a factor of 50-100 puts many demands on the low-level analysis methods. 67 Other low-level problems on my plate Genotyping with pooled DNA Loss of heterozygosity Chromosomal copy number Allele-specific copy number Allele-specific expression Re-sequencing 68 Closing remark Low-level analysis problems will keep arising as new uses for the chips are devised, and as scaling down (feature size) scaling up (feature numbers) and multiplexing continue. With good assays and clear signals (low-hanging fruit), any reasonable method should give satisfactory results, but to get the best out of the data - on a genome-wide scale - more effort is needed. 69 Acknowledgements Xiaoli Qin, Gerry Rubin, Soyeon Ahn, Daphne Georlette Anna Lapuk, John Conboy & Joe Gray, LBNL UC Berkeley Ken Simpson, WEHI Nusrat Rabbee, Genentech Mark Biggins, LBNL Simon Cawley and the team Richard Bourgon, EMBL Affymetrix Henrik Bengtsson Rafael Irizarry & Benilton Carvalho UC Berkeley Johns Hopkins Andrew Sinclair & Howard Slater MCRI Charles Glatt Cornell Medical College 70 Thank you for listening! 71 SNP chips can be used to determine LOH and locus copy number Some sample figures based on the 500K SNP chip showing deletions and amplifications 72 Size = 424.3 kb, Number of SNPs = 118 Results using of dChip and GLAD. 73 Size = 264 kb, Number of SNPs = 72 74 Size = 168 kb, Number of SNPs = 55 75 Size = 52.9 kb, Number of SNPs = 17 We’re working hard to increase the resolution of this approach. 76 Allele-specific gene expression Hybrid gene-expression/genotyping arrays 77 Allele-specific expression 78 P.V. Krishna Pant et al. Genome Res. 2006; 16: 331-339 79

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 2