* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Monday - Biostatistics
Two-hybrid screening wikipedia , lookup
Genetic engineering wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
DNA supercoil wikipedia , lookup
Genomic imprinting wikipedia , lookup
Transposable element wikipedia , lookup
Molecular cloning wikipedia , lookup
Expression vector wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
Messenger RNA wikipedia , lookup
Point mutation wikipedia , lookup
Gene regulatory network wikipedia , lookup
Promoter (genetics) wikipedia , lookup
SNP genotyping wikipedia , lookup
Genomic library wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Transcriptional regulation wikipedia , lookup
Epitranscriptome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene expression profiling wikipedia , lookup
Non-coding DNA wikipedia , lookup
Community fingerprinting wikipedia , lookup
Real-time polymerase chain reaction wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Molecular Inversion Probe wikipedia , lookup
Gene expression wikipedia , lookup
Summer Inst. Of Epidemiology and Biostatistics, 2008: Gene Expression Data Analysis 8:30am-12:30pm in Room W2017 Carlo Colantuoni – [email protected] http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm Class Outline • Basic Biology & Gene Expression Analysis Technology • Data Preprocessing, Normalization, & QC • Measures of Differential Expression • Multiple Comparison Problem • Clustering and Classification • The R Statistical Language and Bioconductor • GRADES – independent project with Affymetrix data. http://www.biostat.jhsph.edu/GenomeCAFE/GeneExpressionAnalysis/GEA2008.htm Class Outline - Detailed • Basic Biology & Gene Expression Analysis Technology – – – • Data Preprocessing, Normalization, & QC – – – – – – – • Bonferroni False Discovery Rate Analysis (FDR) Differential Expression of Functional Gene Groups – – – – – – • Basic Statistical Concepts T-tests and Associated Problems Significance analysis in microarrays (SAM) [ & Empirical Bayes] Complex ANOVA’s (limma package in R) Multiple Comparison Problem – – • Intensity Comparison & Ratio vs. Intensity Plots (log transformation) Background correction (PM-MM, RMA, GCRMA) Global Mean Normalization Loess Normalization Quantile Normalization (RMA & GCRMA) Quality Control: Batches, plates, pins, hybs, washes, and other artifacts Quality Control: PCA and MDS for dimension reduction Measures of Differential Expression – – – – • The Biology of Our Genome & Transcriptome Genome and Transcriptome Structure & Databases Gene Expression & Microarray Technology Functional Annotation of the Genome Hypergeometric test?, Χ2, KS, pDens, Wilcoxon Rank Sum Gene Set Enrichment Analysis (GSEA) Parametric Analysis of Gene Set Enrichment (PAGE) geneSetTest Notes on Experimental Design Clustering and Classification – – – Hierarchical clustering K-means Classification • • • LDA (PAM), kNN, Random Forests Cross-Validation Additional Topics – – – The R Statistical Language Bioconductor Affymetrix data processing example! DAY #1: Genome Biology The Transcriptome Microarray Technology The Human Genome • 2 copies of the entire genome in each cell: • 3.3 billion ”bases” (Gb) • ~30K genes • millions of variants • We each get 1 copy from MOM & 1 from DAD. Each parent passes on a ”mixed copy” (from their parents). DAD MOM • Each copy of the genome is contained in 23 chromosomes: 22+XorY (2 copies = 46 / cell). • All in DNA! YOU • A deoxyribonucleic acid or DNA molecule is a doublestranded polymer composed of four basic molecular units called nucleotides. • Each nucleotide contains a phosphate group, a deoxyribose sugar, and one of four nitrogen bases: adenine (A), guanine (G), cytosine (C), and thymine (T). • The two chains are held together by hydrogen bonds. • Base-pairing occurs according to the following rule: G pairs with C, and A pairs with T. • Directionality & Complementarity: Reverse Complements hybridize. DNA How do these molecular interactions influence directionality and complementarity? G-C pairs are “stickier” than A-T pairs. Another View of DNA Where does an individual gene lie in this schematic? Another View of DNA Another View of DNA Central Dogma of Modern Cellular & Molecular Biology: Transcription From DNA to mRNA: Transcription occurs at Genes Transcript Processing Translation From RNA to Protein: In the exons of protein coding genes (and their mRNA intermediates), each codon (3 base pairs) encodes 1 amino acid in the protein. Perspective: Biological Setup Every cell in the human body contains the entire human genome: 3.3 Gb in which ~30K genes exist. The investigation of gene expression is meaningful because different cells, in different environments, doing different jobs express different genes. Cellular “Plans”: DNA - RNA - PROTEIN Cellular Biology, Gene Expression, and Microarray Analysis A protein-coding gene is a segment of chromosomal DNA that directs the synthesis of a protein via an mRNA intermediate. DNA RNA Protein How do we design and implement probes that will effectively assay expression of ALL (most? many?) genes simultaneously. Laboratory Methods: The Genome and The Transcriptome Easy to sequence some genomic DNA. Easy to sequence some expressed mRNA’s. NOT EASY to catalogue all genomic DNA, all expressed mRNA’s, and to map out the exact relations between all these sequences. Molecular Cell Biology: Components of the Central Dogma Protein Translation START mRNA 5’ UTR protein coding STOP AAAAA 3’ UTR Transcription Genomic DNA 3.3 Gb Gene: Protein coding unit of genomic DNA with an mRNA intermediate. Sequence is a Necessity. DNA Probe START mRNA STOP AAAAA 5’ UTR protein coding 3’ UTR Transcription Genomic DNA 3.3 Gb ~30K genes From Genomic DNA to mRNA Transcripts EXONS INTRONS Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. ~30K >30K Alternative splicing Alternative start & stop sites in same RNA molecule RNA editing & SNPs Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias Designing DNA Probes From Genomic DNA Sequence Sequence & assemble the entire human genome. Search for genes predicted to produce mRNA transcripts. Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. Completeness? Design DNA probes. [ Genomic DNA databases & assembly ] Designing DNA Probes From mRNA Sequences Sequence ALL expressed mRNA molecules. Completeness? Design DNA probes. Unsurpassed as source of expressed sequence Sequence Quality! Redundancy! Completeness? Chaos?!? From Genomic DNA to mRNA Transcripts ~30K >30K >>30K Transcript-Based Gene-Centered Information From Genomic DNA to mRNA Transcripts From Genomic DNA to mRNA Transcripts DAY #1: Genome Biology The Transcriptome Microarray Technology RNA Expression Measurement: Northern Blot “target” SAMPLE 1 SAMPLE 2 RNA Extraction RNA 1 Design + construction of labeled “probe” Seq DB hybridization of labeled probe RNA 2 electrophoreric separation electrophoreric transfer to membrane RNA Expression Measurement: Northern Blot & Microarrays Probe Target Probes Target Northern Northern blots seek to interrogate the expression of ONE gene in a SINGLE hybridization reaction. Microarray Microarrays seek to interrogate the expression of MANY genes simultaneously in a MULTIPLEX hybridization reaction. SEQUENCE knowledge is REQUIRED for BOTH! Target: unknown (sample) Probe: known (synthetic) Hybridization on a Northen Blot Labeled 1 Probe 1 Hybrid MANY Unlabeled Targets MEMBRANE Target: unknown Probe: known MEMBRANE Edwin Southern et al, Nature Genetics Suppl 1999 Hybridization on a Microarray Labeled Target MANY MANY Hybrids MANY Unlabeled Probes Solid Support Target: unknown Probe: known Solid Support Edwin Southern et al, Nature Genetics Suppl 1999 Essentials of Microarray Experimental Design: • Probe sequence selection & design • Probe deposition on solid support • Target Labeling • Target Hybridization Target • Signal detection Probes Microarray cDNA Microarray Fabrication Bacterial clones in 96 well plates Printing onto standard glass microscope slides or nylon cDNA Microarray cDNA Microarray Experimentation Sample Standard RNA Cy5 Cy3 cDNA Hybridized Microarray Scan cDNA Microarray Scanning Cy5 Cy5 Channel Data Merged Image Cy3 Cy3 Channel Data Quantification cDNA Microarray Quantification cDNA Microarray Quantification cDNA Microarray Quantification Log Intensity cDNA Microarray Quantification Log Intensity ] cDNA Microarray Quantification Log Ratio [ / Log Intensity [ + ] Essentials of Microarray Experimental Design: • Probe sequence selection / design • Probe deposition on solid support • Target Labeling • Target Hybridization Target • Signal detection Probes Microarray Agilent (HP) Microarrays 44,000 oligonucleotides (60 NT’s) synthesized in situ using inkjet printing and solid phase phosphoramidite chemistry. 2-channel fluorescence on glass slides. NIA Microarray 10K Full Length cDNA’s Spotted on Nylon P33 One-Channel Affymetrix GeneChip 1,300,000 oligonucleotides (25 NT’s) in 54,000 “probe sets” (11 PM’s and 11 MM’s). Oligo’s synthesized in situ on a silicon wafer using photolithography. One-channel data generated using biotin labeling. Affymetrix GeneChip Affymetrix Probe Set Design 5’ 3’ Reference sequence …TGTGATGGTGCATGATGGGTCAGAAGGCCTCCGATGCGCCGATTGAGAAT… GTACTACCCAGTCTTCCGGAGGCTA Perfectmatch (PM) GTACTACCCAGTGTTCCGGAGGCTA Mismatch (MM) NSB & SB NSB NimbleGen Microarrays 195,000 oligonucleotides (60 NT’s): 5 probes / gene. One-channel data. Oligonucleotides synthesized in situ on a glass slide using maskless, digital micromirror device. Amersham’s CodeLink Arrays 54,841 oligonucleotides (30NT’s). Spotted into a 3-D aqueous polyacrylamide gel surface on a glass slide. One-channel data. ABI’s Human Genome Survey Array 31,077 oligonucleotides (60 NT’s). Oligonucleotides spotted into a 3-D nylon matirx. One-channel data using digoxigenin/AP. Illumina’s BeadChip 1,700,000 oligonucleotides (50 NT’s) immobilized on beads and represented ~30 times (6 full arrays per glass slide). Oligonucleotides anchored on beads distributed in random arrays of plasma etched pits in the silicon wafer. One-channel data using biotin. Essentials of Microarray Experimental Design: • Probe Oligo vs. cDNA (Design: follow-up) sequence Probe length: Specificity & Sensitivity • Probe deposition on solid support • Target Labeling Signal? Amplification? • Target Hybridization • Signal detection 1 vs. 2 channel most important for experimental and analysis design Specifics of each technology will determine idiosyncrasies of data preprocessing. Target Probes Microarray An Example to Remind us of Gene Structure and Gene Cross-Referencing Issues 2 independent probes (!) on your microarray interrogate the same gene (!) and both show an extreme expression change in your cell line following treatment: YES!!! However, the directionality of this change is opposite: one probe shows induction while the other shows repression: NO !?! Log Intensity cDNA Microarray Quantification Log Intensity cDNA Microarray Quantification Log Ratio Probes designed to interrogate expression of the same gene! Log Intensity From Genomic DNA to mRNA Transcripts SF1 in Entrez Gene (RefSeq): A Complex Transcriptional Profile Lacks regulatory SPSP phosphorylation motif Probe Decreased Probe Increased SF1 in AceView: A Complex Transcriptional Profile! Gene: Protein coding unit of genomic DNA with an mRNA intermediate. Sequence is a Necessity. DNA Probe START mRNA STOP AAAAA 5’ UTR protein coding 3’ UTR Transcription Genomic DNA 3.3 Gb ~30K genes From Genomic DNA to mRNA Transcripts EXONS INTRONS Protein-coding genes are not easy to find - gene density is low, and exons are interrupted by introns. ~30K >30K Alternative splicing Alternative start & stop sites in same RNA molecule RNA editing & SNPs Transcript coverage Homology to other transcripts Hybridization dynamics 3’ bias USCS Genome Browser: Genes Transcripts Probes (Live Web Demo) USCS example with genes, transcripts, and probe mapping – custom tracks.