Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioSci D145 Lecture #4 • Bruce Blumberg ([email protected]) – 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) – phone 824-8573 • TA – Riann Egusquiza ([email protected]) – 4351 Nat Sci 2– office hours M 1-3 – Phone 824-6873 • check e-mail and noteboard daily for announcements, etc.. – Please use the course noteboard for discussions of the material • Updated lectures will be posted on web pages after lecture – http://blumberg-lab.bio.uci.edu/biod145-w2017 • Last year’s midterm is now posted. • Term paper outlines due Friday (2/3) by midnight. BioSci D145 lecture 1 page 1 ©copyright Bruce Blumberg 2014. All rights reserved Term paper outline • Title of your proposal • A paragraph introducing your topic and explaining why it is important; i.e., what impact will the knowledge gained have. – Why should any funding agency give you money to pursue this research? • NIH now requires a statement of human health relevance for all grant applications • NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research • Present your hypothesis – A supposition or conjecture put forth to account for known facts; esp. in the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at. • Enumerate 2-3 specific aims in the form of questions that test your hypothesis – At least one of these aims needs to have a strong “whole genome” component BioSci D145 lecture 4 page 2 ©copyright Bruce Blumberg 2004-2016. All rights reserved Modern DNA sequence analysis • Cycle sequencing – Virtually all commercial DNA sequencing today is done by cycle sequencing with fluorescent ddNTPs • ABI Big Dye chemistry – Template preparation still tedious for small scale • TempliPHi used in genome centers (obviated need for most automation) – Capillary sequencers predominant form of technology in use • But, next generation sequencing is already coming online and will rapidly displace old technology in genome centers. – 454 sequencing (Roche) – Solexa (Illumina) – SoLID (Applied Biosystems) • 3rd generation sequencing (individual DNA molecule) now available – e.g., Pacific Biosciences (sequence reads of 1,000-10K bases) BioSci D145 lecture 4 page 3 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies • Sequencing by hybridization – Construct a high-density microchip with all possible combinations of a short oligonucleotide • Up to 25-mers • By photolithography – Synthesized on chip directly – Label and hybridize fragment to be sequenced – Wash stringently – Read fluorescent spots – Reconstruct sequence by computer BioSci D145 lecture 5 page 4 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Sequencing by hybridization rarely used for de novo sequencing – Extremely fast and useful to sequence something you already know the sequence of but want to identify mutation - resequencing – Disease causing changes • e.g in mitochondrial DNA – SNP discovery – Works best for examining sequence of <10 kb BioSci D145 lecture 5 page 5 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • http://www.affymetrix.com/products/arrays/index.affx • SNP discovery – Photo shows mitochondrial chip – Right panel shows pairs of normal (top) vs disease (bottom) (Leber’s Hereditary Optic Neuropathy) • Top 3 disease mutations • Bottom control with no change BioSci D145 lecture 5 page 6 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies – Next Generation sequencing • 2nd generation = high throughput, short sequences • 3rd generation = single molecule sequencing • Small number of sequence templates (thousands) but very long reads (~105 bp) • What is the immediate implication of this technology for genome assembly? We should now be able to completely sequence large insert clones directly and avoid fragmentation by repetitive elements! • Key review is Metzger, M.L. (2010) Sequencing technologies — the next generation, Nature Reviews Genetics 11, 31-46. BioSci D145 lecture 5 page 7 ©copyright Bruce Blumberg 2004-2007. All rights reserved 3rd generation Other sequencing technologies (contd) • Illumina (Solexa) sequencing – https://www.illumina.com/content/dam/illuminamarketing/documents/products/illumina_sequencing_introduction.pdf – Based on synthesis of complementary strand to a template (like Sanger) • Detection of elongation with labeled terminators – Steps • Library generation - fragment genome to appropriate size (depends on application) and add adapters to each end • Cluster generation – capture fragments on lawn of oligos and amplify • Sequencing – reversible terminator • Data analysis – – align reads to reference genome – Analysis of reads BioSci D145 lecture 5 page 9 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Illumina sequencing (contd) – Library preparation – fragment target and add adapters. • Can multiplex to gain additional capacity • That is, Hiseq-X can generate 1.8 Tb of data per run, but don’t need this much for most applications so use different adapters and “bar-code” samples. BioSci D145 lecture 5 page 10 ©copyright Bruce Blumberg 2004-2007. All rights reserved • Bar coding sequence analysis BioSci D145 lecture 5 page 11 ©copyright Bruce Blumberg 2004-2017. All rights reserved Other sequencing technologies (contd) • Deep sequencing – What is the point? • Can generate huge number of reads in parallel • Miniseq – 7.5 Gb (25 million reads/run 2 x 150 bp) • MiSeq – 15 gb (15 million reads/run 2 x 300 bp) • NextSeq – 120 Gb (400 million reads/run 2 x 150 bp) • HiSeq – 1.5 Tb (5 billion/run 2 x 150 bp) • HiseqX – 1.8 Tb (6 billion/run 2 x 150 bp) • What is massively parallel sequencing good for? – – – – – – – Rapid sequencing of genomes, or resequencing of known sequences Ancient DNA (even dinosaurs? – Svante Pääbo says ~200K years is limit) ChIP-sequencing (week 6) Sequencing ESTs or other tags Determining microbial diversity in field samples Transcriptome sequencing Identifying variations in • Viral populations • Gene sequences in mixed populations BioSci D145 lecture 5 page 12 ©copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing • Idea is to sequence many copies of the same thing – Gene sequence – mRNA transcript BioSci D145 lecture 5 page 13 ©copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing (contd) • What is amplicon sequencing good for? – Discovery of rare somatic mutations in complex samples (e.g., cancerous tumors - mixed with germline DNA) based on ultra-deep sequencing of amplicons – Sequencing collections of exons from populations of individuals to identify diversity – Sequencing collections of human exons from populations of individuals for the identification of rare alleles associated with disease – Analysis of viral quasispecies present within infected populations in the context of epidemiological studies – Evolutionary biology in populations BioSci D145 lecture 5 page 14 ©copyright Bruce Blumberg 2004-2007. All rights reserved The human genome • In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs – Celera -> 39114 – Ensembl -> 29691 – Consensus from all sources ~30K • Number of genes – C. elegans – 19,000 – Arabidopsis - 25,000 • Predictions had been from 50-140k human genes – What’s up with that? – Are we only slightly more complicated than a weed? – How can we possibly get a human with less than 2x the number of genes as C. elegans – Implications? • UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002 BioSci D145 lecture 4 page 15 ©copyright Bruce Blumberg 2004-2016. All rights reserved The human genome • The answer – Gene sets don’t overlap completely (duh) – Floor is 42K – 130029build #236 UniGene Clusters (from EST and mRNA sequencing) – http://www.ncbi.nlm.nih.gov/unigene – Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous years) (“final” count • Important questions to be answered about what constitutes a “gene” – Crick genes? DNA-RNA-protein – How about RNAs? – miRNAs? – Antisense transcripts? – lncRNAs? BioSci D145 lecture 4 page 16 ©copyright = 42113 Bruce Blumberg 2004-2016. All rights reserved Genome sequencing(contd) – Whole genome shotgun sequencing (Celera) • premise is that rapid generation of draft sequence is valuable • why bother trying to clone and sequence difficult regions? – Basically just forget regions of repetitive DNA - not cost effective • using this approach, genomes rarely are completely finished – rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% • problems – sequence may never be complete as is C. elegans – much redundant sequence with many sparse regions and lots of gaps. – Fragment assembly for regions of highly repetitive DNA is dubious at best – “Finished” fly and human genomes lack more than a few already characterized genes BioSci D145 lecture 4 page 17 ©copyright Bruce Blumberg 2004-2016. All rights reserved Genome sequencing (contd) • Knowing what we know now – how to approach a large new genome? – Xenopus tropicalis 1.7 Gb (about ½ human) – BAC end sequencing – Whole genome shotgun – HAPPY mapping and radiation hybrid mapping to order scaffolds – Gaps closed with BACS – 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes) – Finishing now in process • But how “finished” will it be? • 2016 update – now version 9.0 – FINALLY integrated BAC end sequences – Integrated genetic map – 50% of contigs > 72 kb – Xenopus laevis – v9.1 – • >90% of genome in chromosomal scaffolds • 2 “subgenomes” fully characterized. • annotation remains a big challenge. BioSci D145 lecture 4 page 18 ©copyright Bruce Blumberg 2004-2016. All rights reserved Functional Genomics - Analysis of gene function on a whole genome basis • Genome projects – DNA sequencing – Human genome, mouse, rat, Drosophila, C. elegans “finished” – model organisms progressing rapidly – Lots of new genes, but many lack known function • Functional genomics – Identification of gene functions • associate functions with new genes coming from genome projects • function of genes identified from characterizing diseases or mutants – Identification of genes by their function • discovery of new genes BioSci D145 lecture 4 page 19 ©copyright Bruce Blumberg 2004-2016. All rights reserved *Methods of profiling gene expression – large scale to whole genome • What are the possibilities – Array – micro or macro – Sequence sampling (EST generation) – SAGE – serial analysis of gene expression – Massively parallel signature sequencing (RNA-seq, Illumina, 454) • DNA microarray analysis was, until now totally dominant method – Two basic flavors • Spotted (spot DNA onto support) – cDNA microarrays – Oligonucleotide arrays – Moderately expensive • Synthesized (use photolithography to synthesize oligos onto silicon or other suitable support – Affymetrix Gene Chips dominate – VERY expensive – Both are in wide use and suitable for whole genome analysis BioSci D145 lecture 4 page 20 ©copyright Bruce Blumberg 2004-2016. All rights reserved Spotted arrays • Source material is prepared – cDNAs are PCR amplified OR – Oligonucleotides synthesized • Spotted onto treated glass slides • RNA prepared from 2 sources – Test and control • Labeled probes prepared from RNAs – Incorporate label directly – Or incorporate modified NTP and label later – Or chemically label mRNA directly • Hybridize, wash, scan slide • Express as ratio of one channel to other after processing BioSci D145 lecture 4 page 21 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarray types • Stanford type microarrayer – http://cmgm.stanford .edu/pbrown/mguide/ index.html • Printing method – Reminiscent of fountain pen BioSci D145 lecture 4 page 22 ©copyright Bruce Blumberg 2004-2016. All rights reserved Strategy to identify RAR target genes Agonist - TTNPB Antgonist - AGN193109 Harvest st 18 Poly A+ RNA Poly A+ RNA Amino-allyl labeled 1st strand cDNA Amino-allyl labeled 1st strand cDNA Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Alexa Fluor 555 (cy3) Alexa Fluor 647 (cy5) Probe microarrays upregulated BioSci D145 lecture 4 page 23 ©copyright downregulated Bruce Blumberg 2004-2016. All rights reserved DNA microarray • Statistical analysis of output – VERY IMPORTANT! • Replicates are very important • Preprocessing of data is needed – To remove spurious signals BioSci D145 lecture 4 page 24 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarray • Advantages – Custom arrays possible and affordable – Ratio of fluorescence is robust and reproducible • Disadvantages – Availability of chips – Expense of production on your own – Technical details in preparation BioSci D145 lecture 4 page 25 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips • High density arrays are synthesized directly on support – 4 masks required per cycle -> 100 masks per chip (25-mers) – Pentium IV requires about 30 masks – G.P. Li in Engineering directs a UCI facility that can make just about anything using photolithography BioSci D145 lecture 4 page 26 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips Streptavidin/phycoerythrin BioSci D145 lecture 4 page 27 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips – Each gene is represented by a series of oligonucleotide pairs • One perfect match • One with a single mismatch – Only hybridization to perfect match but not mismatch is considered to be real – Gene is considered “detected” if > ½ of oligo pairs are positive – Number of pairs depends on organism and how well characterized array behavior is • Human uses 8 pairs • Xenopus uses 16 pairs BioSci D145 lecture 4 page 28 ©copyright Bruce Blumberg 2004-2016. All rights reserved Affymetrix GeneChips • Result is in single color – Always need two chips – control and experimental for each condition – Also need replicates for each condition – For diverse biological samples (e.g., humans) 10 replicates required! – For less diverse samples (cell lines) probably 5 replicates needed • Advantages – Commercially available – Standardized • Disadvantages – About $700 to buy, probe and process each chip (at UCI)! • About $500 elsewhere – May not be available for your organism of interest – No ability to compare probes directly on the same chip • Must rely on technology BioSci D145 lecture 4 page 29 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are they good for? – Identifying genes expressed in one condition vs. another • One tissue vs. another (heart vs liver) • Tissue vs. tumor (liver vs. hepatocarcinoma) • In response to a treatment (e.g., RA) • In response to disease (e.g., after viral infection) – Building expression profiles • Tissues • Cancers • Developmental stages • Expressed genes – Identifying organisms in food • Array can identify which animals are present in a mix • http://www.dnavision.com/files/FOODIDBrosh%20En.pdf BioSci D145 lecture 4 page 30 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are they good for? (contd) – Response of animal to drugs or chemicals • Toxicogenomics • Pharmacogenomics – Diagnostics • SNP analysis to identify disease loci • Specific testing for known diseases BioSci D145 lecture 4 page 31 ©copyright Bruce Blumberg 2004-2016. All rights reserved DNA microarrays • What are the limitations of microarray technology? What sorts of factors might confound the experiment? – Signal intensity (or signal/noise) • Improved dyes, label uniformly – Biological variation (samples are inherently different) • Sufficient # of replicates is key • keep individuals separate – Not all mRNAs will be present at sufficient levels to detect • Amplification, but beware of bias – Good statistical analysis is required • Bayesian statistics are best (Pierre Baldi is local expert) – calculating the probability of a new event on the basis of earlier probability estimates which have been derived from empiric data – i.e., don’t assume random distribution in datasets, calculate probability based on real data – Bayesian approach great for small number of replicates, converges on t-test at high number of replicates • http://cybert.microarray.ics.uci.edu/ BioSci D145 lecture 4 page 32 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • Microarray was once the dominant method – Direct RNA sequencing methods are rapidly displacing microarrays – SAGE (serial analysis of gene expression) • Nanostring is modern implementation • Short sequences – RNAseq • Directly sequence large numbers of RNAs • Longer sequences • SAGE – Relies on generating many very short sequences and matching these to the genome – 10 bp = short SAGE – 17 bp = “long” SAGE BioSci D145 lecture 4 page 33 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • SAGE (continued) – What is the obvious shortcoming of this method? – Sequences may not be unique and could have difficulty mapping to the genome BioSci D145 lecture 4 page 34 ©copyright Bruce Blumberg 2004-2016. All rights reserved Other methods of transcriptome analysis - parallel • RNA seq – Ali Mortazavi is local expert – Use of massively parallel sequencing allows precise quantitation of transcript – Also allows discovery of rare splice forms – Discovery of unexpected transcripts – Main problem is in mapping sequence calls to genome • Sequencing has 1-2% errors which can make mapping to genome fail • or induce “in silico cross-hybridization” – Mapping to incorrect genomic location BioSci D145 lecture 4 page 35 ©copyright Bruce Blumberg 2004-2016. All rights reserved Microarray vs. RNAseq • RNAseq – No assumption re transcripts but need genome sequence • Microarray – Assumes you know all the transcripts – Any sequence you did not know was expressed will not be there. • except whole genome tiling arrays – Kapranov paper – Can discover novel sequences or new splice forms not yet characterized (if you have genome) – Detection limit issues • Signal-noise ratio – Detection limits are not a problem – can detect small # – Well validated , expression analysis can be quantitative – Getting better, expression analysis can be quantitative BioSci D145 lecture 4 page 36 ©copyright Bruce Blumberg 2004-2016. All rights reserved