Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Expression Data and Microarrays CMMB November 29, 2001 Todd Scheetz Overview Gene expression – mRNA – protein Northern Blots RT-PCR SAGE MicroArray Gene Expression Review Transcription – generation of mRNA from genomic DNA a complete copy is made, including both introns and exons. pre-mRNA genomic AAAA... pre-mRNA Gene Expression Review Processing / Splicing – removal of the introns from the pre-mRNA mature mRNA – also exported from the nucleus to the cytoplasm – alternative splicing AAAA... pre-mRNA AAAA... AAAA... mature mRNAs (splice variants) Gene Expression Review Translation – takes an mRNA molecule and uses it to construct an amino acid sequence. – the ribosome is the underlying machinery used in the process of translation. Measuring Gene Expression Two major differentiating factors… Quantitative vs. Qualitative mRNA vs protein Most techniques can be used to determine quantitative expression levels. Ex. EST sequencing Measuring Gene Expression More sophisticated experiments… Comparing expression levels of multiple genes Comparing co-regulation or differential regulation. Ex. EST sequencing Northern Blot Measure relative expression levels of mRNA 1. mRNA isolation and purification 2. electrophorese on a gel 3. The gel is probed by hybridizing with a labeled clone for the gene under study. Northern Blot Northern Blot RT-PCR Measures relative expression of mRNA 1. Isolate and purify mRNA 2. reverse transcription 3. PCR amplification 4. run on gel and probe/hybridize RT-PCR RT-PCR Why use RT? Can observe very low levels of expression Requires very small amounts of mRNA The bad… Potential expression-level skew due to nonlinearity of PCR Have to design multiple custom primers for each gene. SAGE SAGE SAGE Tags are isolated and concatermized. Relative expression levels can be compared between cells in different states. SAGE --gene to tag mapping http://www.ncbi.nlm.nih.gov/SAGE/SAGEcid.cgi?cid=28726 MicroArray What are they? allow 1000’s of expression analyses to be performed concurrently. What technologies are used? How to analyze the image? How to analyze the expression data? What bioinformatics challenges are there? Potential Microarray Applications • Drug discovery / toxicology studies • Mutation/polymorphism detection Differing expression of genes over: – Time – Tissues – Disease States • Sub-typing complex genetic diseases DNA Array Technology Array Type Nylon Macroarrays Nylon Microarrays Glass Microarrays Oligonucleotide Chips Spot Density (per cm 2 ) < 100 < 5000 < 10,000 <250,000 Probe Target Labeling cDNA cDNA cDNA oligo's RNA mRNA mRNA mRNA Radioactive Radioactive/Flourescent Flourescent Flourescent Physical Spotting MicroArray Glass Microarray 326 Rat Heart Genes, 2x spotting Photolithographic MicroArray MicroArray MicroArray MicroArray Overview of data capture two different mRNA populations, labeled with different fluors excited by a laser each fluour excites at a different wavelength, which is captured using a photodetector attached to a filter tuned to the particular fluor MicroArray Overview of image analysis spot identification grid alignment skew image normalization variable background uneven hybridization Microarray Data Pipeline Image Analysis/Data Quantization • Feature (target probe) segmentation • Data extraction and quantization of: – Background – Feature • Correlation of feature identity and location within image • Display of pseudo-color image Image Segmentation + Microarray Experiment Design • Type I: (n = 2) – How is this gene expressed in target 1 as compared to target 2? – Which genes show up/down regulation between the two targets? • Type II: (n > 2) – How does the expression of gene A vary over time, tissues, or treatments? – Do any of the expression profiles exhibit similar patterns of expression? Motivation & Design Constraints • Probe set design involves the prioritizing and parsing of an initial data set containing potentially hundreds of thousands of probe candidates to define a reasonably sized set for use in a microarray experiment • A single hybridization can produce several thousand data tuples, each containing multiple (n>10) measurements • No “All-in-one” software package is currently available, therefore, communication of data between the packages must be facilitated by the pipeline Probe Set Design • Goal of probe set design is to identify a reasonably sized subset of probes from a much larger starting set from a variety of sources • By defining a set of criteria, an investigator should be able to create new probe sets or refine existing sets • Pruning a data set should be done in several stages: Use readily available information to limit scope of data Obtain more information about remaining probes Narrow focus based on additional information Iterate until desired data set is obtained Sample Probe Set Design Criteria • 1° -- Direct – – – – • 2° -- Indirect Species Tissue Chromosome Sequence Available • Quality • Tail/Poly(A) signal – Map position known? – Cluster size – Blast results • Confidence value • Homology (or lack of) • Annotation contains words like “transfer” • 3’ & 5’ EST reads hit same gene – Syntenic Map Information – Known phenotypes in other species cDNA Microarray Slide Creation • cDNA clones defining a probe set must be re-arrayed from their sources (e.g. local storage or commercial) into a format suitable for amplification and printing (e.g. 96-well microtiter plates) • Based on the size of the probe set and the limitations of the printer, a parameter set (# of pens, spot spacing, grid dimensions,…) must be defined for printing the probe set onto the slide(s) • A mapping operation must be performed in order to track each probe from source to destination in order to correlate known information with a particular “spot” in a microarray image MicroArray Overview of data analysis vs. time vs. other genes co-reg. diff. reg pathway ident. Data Analysis • Data analysis consists of several post-quantization steps: – – – – Statistics/Metrics Calculations Scaling/Normalization of the Data Differential Expression Coordinated Gene Expression (aka clustering) • Most software packages perform only a limited number of analysis tasks • Databases can facilitate the movement of data between packages Scaling and/or Normalization • Positive Controls – ‘Spiked’ DNA – Housekeeping Genes – Total Array • Negative Controls – Foreign DNA – ‘Empty’ spots Scaling and/or Normalization • • • • • Linear regression Log-linear regression Ratio statistics Log(ratio) mean/median centering Nonlinear regression MicroArray Bioinformatics challenges 1. data management 2. utilizing data from multiple experiments (type II) 3. utilizing data from multiple groups * with different technologies * with only processed data available Gene A B C E D Condition 1 2 34 + + + - - + + - - + + - + - ? 0 60 120 180 Time 0 + A - Database(s) C 1 Local Alignment 3’ … A C G G G C … … ATG … 5’ 3’ … A C G G G C … … ATG … 5’ 3’ … A C G G G A … … ATG … 5’ B 2 3 4 Timepoints Search Window MicroArray data management clone - spot clone - gene raw expression level normalized expression level annotation/links expression profile MArray Expt Mgmt Redux Experiment 5-Tuple: (Probe Set_ID, Target_ID, Hyb Condition_ID, Hyb Iteration_ID, GenePix_Analysis_ID) Database Support (EBI Schema) http://www.ebi.ac.uk/arrayexpress/ http://www.bioinf.man.ac.uk/microarray/maxd Differential Expression • Type I analysis • Look for genes with vastly different expression under different conditions – How do you measure “vastly different”? – What role should derived statistics play? Type I: Differential Expression Gene 1 vs Gene 2 60000 50000 Gene 2 40000 30000 20000 10000 0 0 10000 20000 30000 Gene 1 40000 50000 60000 Coordinated Gene Expression • • • • • Type II analysis “Eisen”ized data (dendrograms) Self-Organizing Maps Principal Component Analysis k-means Clustering Hierarchical Clustering Self Organizing Maps Current Software Statistics Normalization Diff Exp X X X X X X X CGE Quantization Provider Spotfire Inc FujiFilm Premier Biosoft Intl Inc Lion Bioscience Imaging Research Inc TIGR Imaging Research Inc Applied Precision Inc Stanford University U of W ashington MIT Axon Instruments Biodiscovery Silicon Genetics Genomic Solutions Biodiscovery Scanalytics NEN GeneMachines Research Genetics Packard Instrument Co Stanford University GCG TIGR Cose Rosetta Probe Set Design Software Name Array Explorer Array Gauge ArrayDesigner ArraySCOUT ArrayStat ArrayViewer ArrayVision arrayW oRx Cluster/Xcluster Crazy Quant GeneCluster GenePix Pro GeneSight GeneSpring GeneTAC Imagene MicroArray Suite Micromax OmniGrid Pathways Analysis Quant Array ScanAlyze SeqArray Spotfinder XDotsReader Resolver X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X Software/Pipeline Integration • A centralized database facilitates the archival, manipulation, and mining of all microarray data • Most analysis programs can output data in a textual format which is easily input into the database • Output from one program can be used as input to a second program either directly or through a filtering operation facilitated by the database and a set of programs to mine and manipulate the data • Data from multiple hybridizations may need to be combined in order to perform coordinated gene expression analysis Standards... Want ability to exchange microarray experiment data using a common format. MGED -- Microarray Gene Expression Group www.mged.org MAGEML Rosetta Inpharmatics GEML -- www.geml.org MIAME - Minimum Information About Microarry Experiments Data and Limitations Current Controversy: Should the raw data be archived? If so, who should do it? Each slide (25 mm x 75 mm) is scanned at 200 pixels per mm. Typical spot size = 100 um Center-to-center = 195 um Potential spots = 42,000 “Raw” image size = ~250 MB Other Types of Microarrays • Genomic BAC arrays – allows assessment of “small” deletions • Tissue arrays – allows assessment of protein expressions Type II: Data Partitioning • Identify genes with similar expression • Grouping unknown genes with known genes may provide insight into function of unknown genes • Only useful for genes with varying expression levels Protein Expression Protein expression may not correlate with mRNA expression. How to measure levels of protein expression? Immunochemistry 2-antibody approach Protein Expression Indirect Immunofluorescence cells are fixed permeabilize the cells incubate with primary antibody incubate with secondary antibody Protein Expression Protein Expression Immunofluorescence green -- tubulin red -- gamma tubulin blue -- DNA Protein Expression Immunofluorescence red -- alpha tubulin green -- vimentin (cytoskeletal protein) blue -- DNA Protein Expression High-throughput methods array multiple tissue samples onto slide, and hybridize