Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microarrays and Gene Expression DTC Bioinformatics Course 9th February 2010 Helen Lockstone Overview • Background • Array design • Applications of array technology • Steps in data analysis • Finding differentially expressed genes • Biological interpretation Schedule Time Topic 9.30-10.30 Introduction to microarray technology and applications 10.30-10.45 Break 10.45-11.30 Microarray data analysis 11.30-13.00 Practical 1 13.00-14.00 Lunch 14.00-14.45 Biological interpretation 14.45-15.00 Break 15.00-17.00 Practical 2 Microarrays in the Literature 7000 Number of papers 6000 5000 4000 3000 2000 1000 0 Year The Central Dogma Transcriptome measured by microarrays Premise of Microarrays • Compare gene expression between groups • Differentially expressed genes may provide some biological insight • But not magical solutions! Typical Microarray Designs • • • • • • • Disease vs control Good prognosis vs poor prognosis Different tumour types Effect of treatment Effect of stimulus Time course Different tissues/stages of development Criticism of Microarrays • Non-hypothesis driven “fishing expeditions” • Because microarray experiments are expensive and timeconsuming to interpret, often published as a stand-alone experiment • Produce large amounts of data, interpretations can be very different (but equally valid) • Further experimental work, following up hypotheses suggested from array data, can produce elegant studies • Perception that data is unreliable – validation Microarray Repositories • GEO – http://www.ncbi.nlm.nih.gov/geo/ • ArrayExpress http://www.ebi.ac.uk/microarray-as/ae/ • Excellent resource of microarray data • MIAME guidelines What is a Microarray? • Glass slide consisting of hundreds of thousands of probes arranged in grid layout • Each probe detects a particular RNA species (transcript) • Hybridisation occurs by complementary base-pairing • Make quantitative measurements – signal from each probe is proportional to the amount of hybridised RNA • Interrogate entire genome in single experiment Microarray Technology Probes cDNA Oligonucleotides PCR products Design Targeted to genes Tiling (chromosomes, promoters) Fabrication Method Spotted (robotic printing) Photolithography (synthesised in-situ) Type One-colour (log intensities) Two-colour (log ratios) Labelling molecules Cy3 (green), Cy5 (red), biotin Experimental Protocol Microarray Manufacturers Company Established Main Microarray Technology Human Headquarters WholeGenome Array released Affymetrix 1992 GeneChip 1994 Santa Clara, CA Illumina 1998 BeadChip 2005 San Diego, CA Roche NimbleGen 1999 High-density tiling arrays Agilent 1999 aCGH, ChIPchip, custom Madison, WI 2004 Santa Clara, CA Array design Affymetrix Microarrays Manufacturing microarrays for >15 years 25bp probes – 11 individual probes comprise a probe-set, signal combined to estimate gene expression Whole human genome array has >50,000 probesets Size array surface 1.28cm2 3’ expression arrays – probes designed to 3’ end of transcript Recent Developments • Limitations of 3’ array design – – – – Assumes representative of entire gene Assumes well-defined 3’ end of gene Can’t assess splicing events Can be difficult to distinguish homologous genes • Whole transcript arrays – 4-probe probesets designed to each exon – Gene 1.0 and Exon 1.0 arrays Exon Array Design Picture from Affymetrix Illumina Beadchip Arrays Beads randomly occupy wells on surface of array 30-40 replicates of each bead type (probe) Longer probe length – typically one probe per gene Applications of Microarray Technology Microarray Applications Gene Expression Alternative Splicing microRNA expression SNP Genotyping ChIP-chip DNA Methylatio n Comparative Genomic Hybridisation Gene Expression • Still most common use for microarrays • Aim to determine differential expression between groups of samples e.g. disease and control • Generate hypotheses about the mechanisms underlying the disease of interest Alternative Splicing Up to 75% of human genes may produce alternative transcripts Increases protein diversity from given set of genes Alternative transcripts from same gene can produce proteins with different, even opposite, functions (e.g. Bcl-x) Role in disease - mutations can disrupt splice sites or splicing machinery Alternative Splicing • Affymetrix exon array allows investigation of alternative splicing • Custom arrays with junction probes • Additional layer of analysis Alternative Poly-A Sites • Alters length of 3’ UTR - may change which target regions for miRNAs are present Alternative Splicing MicroRNAs • Small non-coding RNAs (~22bp) • Sequence-specific binding to 3’ UTRs • Post-transcriptional gene silencing Picture from He et al. Nature Reviews Cancer 7, 819-822 (2007) SNP Arrays • Illumina and Affymetrix • ~6 million SNPs genome-wide • Genotype individuals in high-throughput and cost-effective manner • Genome-wide association studies • eQTL studies Tiling Arrays • Applications so far use arrays with probes designed to genes/miRNAs/SNPs of interest • Tiling arrays consist of high-density probes covering a particular region(s) of the genome • Identify novel transcripts, exons DNA Methylation • Methylation of cytosine bases (CpG islands) in gene promoter regions can silence transcription • Epigenetic mechanism • Two-colour hybridisation ChIP-chip • Method to identify transcription factor binding sites in an unbiased fashion • Cross-link protein (TF) of interest with DNA • Use immuno-precipitation to pull down DNA fragments bound to the protein (enriched sample) • Hybridise with genomic DNA to obtain log-ratio • Again looking for large positive ratios Comparative Genomic Hybridisation Trisomy 13 in female compared to reference male • Detect regions of amplification/deletion (copy number changes) • Feature of cancer – hybridise sample with reference DNA (copy number=2) • Potential dosage effects on genes in affected regions Analysing Gene Expression Data R and BioConductor • Powerful, open-source software for statistical analysis and graphical visualisation • Greater functionality provided by software packages contributed by researchers • BioConductor packages are specifically for genomic data – affy – limma – vsn Analysis Steps • Check quality of the data • Decide if any samples are outliers • Preprocessing and normalisation • Statistical analysis to find differentially expressed genes • Tools for biological interpretation Data Quality • Looking for good signal and similar metrics across all arrays in experiment (after normalisation between arrays) • Poor signal could indicate a hybridisation problem or degraded sample • Control probes for hybridisation, labelling and sample can help identify problems Illumina Array Metrics • • • • • • Average signal Number of detected genes Housekeeping genes signal Biotin controls Hybridisation controls Negative control probe signal Processing Data • Background correction • Transform data to log scale (more suitable for statistical analysis) • Normalisation between arrays (adjust for systematic differences such as overall brightness) • Probe-set summarisation (Affymetrix) or across replicate probes (Illumina) Exploring Data – Boxplots Signal Intensity Exploring Data - PCA Outlier Samples • Potential outlier samples will look different to others in the experiment • No definitive rules to decide when to exclude a sample from analysis – Depends on size of experiment – Can be useful to run analysis with and without outlier to assess effect on results – Always re-normalise data excluding any outlier samples before proceeding Outlier Sample PCA indicating outlier sample Filtering • Lose data but signal from low intensity probes is noisy and can give false positives • Detection p-values calculated for each probe based on overlap of signal with negative control probe signal distribution • Criteria – Detected in all samples/at least one sample – Detected in at least one group Detecting Differentially Expressed Genes • Linear Models for Microarray Analysis (limma) • Handles analysis of simple and complex experimental designs • For two-group comparisons, analogous to t-test, otherwise ANOVA • Uses information from all genes to estimate variance – Reduces chance of false positives from very low variance genes – More robust for small sample sizes Log normalised intensity limma • Fits linear model for each gene 10 • Test whether slope = 0 for each gene and assign p-values 8 6 4 Group 1 Group 2 • Multiple testing correction - FDR Effect of other variables • Wt and Mut groups • Three different litters • Top gene ~ 5x higher expression in Wt compared to Mut • Similarly expressed across litters in both genotypes Strong litter effect • Overlap between groups • Within litters, consistent pattern of higher expression in WT vs Mut • Within genotypes, B>C>A – expression depends on litter • Accounting for this variance increases power Limma Output Limma Output • Small sample size and subtle effects can mean no probes would be considered statistically significant • Ranked in order of evidence for differential expression – can still be explored • Biological interpretation can be most difficult step – tools available