Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BioSci D145 Lecture #4 • Bruce Blumberg ([email protected]) – 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment) – phone 824-8573 • TA – Bassem Shoucri ([email protected]) – 4351 Nat Sci 2, 824-6873, 3116 Office hours M 2-4 • lectures will be posted on web pages after lecture – http://blumberg.bio.uci.edu/biod145-w2015 – http://blumberg-lab.bio.uci.edu/biod145-w2015 – Last year’s midterm is now posted. It is useful to work through the problems to see what sort of questions I ask and how best to study. – Term paper outlines due Thursday (1/29) by midnight. Please upload to the drop box (1st choice) or e-mail to me (2nd choice). BioSci D145 lecture 1 page 1 ©copyright Bruce Blumberg 2010. All rights reserved Term paper outline • Title of your proposal • A paragraph introducing your topic and explaining why it is important; i.e., what impact will the knowledge gained have. – Why should any funding agency give you money to pursue this research? • NIH now requires a statement of human health relevance for all grant applications • NSF wants to know what is the intellectual merit of your proposed research and what broader impacts of your proposed research • Present your hypothesis – A supposition or conjecture put forth to account for known facts; esp. in the sciences, a provisional supposition from which to draw conclusions that shall be in accordance with known facts, and which serves as a starting-point for further investigation by which it may be proved or disproved and the true theory arrived at. • Enumerate 2-3 specific aims in the form of questions that test your hypothesis – At least one of these aims needs to have a strong “whole genome” component BioSci D145 lecture 4 page 2 ©copyright Bruce Blumberg 2004-20014. All rights reserved Genome sequencing • The problem – Genome sizes for most eukaryotes are large (108-109 bp) – High quality sequences only about 600-800 bp per run • The solution – Break genome into lots of bits and sequence them all – Reassemble with computer • The benefit – Rapid increase in information about genome size, gene comparisons, etc • The cost – 3 x 109 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 106 reactions for 1x coverage! – Need both strands (x2), need overlaps and need to be sure of sequences – ~107-108 reactions/runs required for a human-sized genome – About $1-2 per reaction these days, ~$8 commercially. BioSci D145 lecture 4 page 3 ©copyright Bruce Blumberg 2004-2014. All rights reserved Genome sequencing(contd) BioSci D145 lecture 4 page 4 ©copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) • Shotgun sequencing (contd) – How to minimize sequence redundancy? • Best way to minimize redundancy is map before you start – C. elegans was done this way - when the sequence was finished, it was FINISHED » mapping took almost 10 years – mapping much too tedious and nonprofitable for Celera » who cares about redundancy, let’s sequence and make $$ » There is scientific value to draft genomes, too. • why does redundancy matter? – Finished sequence today costs about $0.50/base – Note that at 10x, 99.995% coverage leaves at least 150 kb of the human genome unsequenced BioSci D145 lecture 4 page 5 ©copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) – Mapping by hybridization – Mapping by fingerprinting BioSci D145 lecture 4 page 6 ©copyright Bruce Blumberg 2004-2007. All rights reserved Traditional (map first) vs STC (map as you go along) mapping BioSci D145 lecture 4 page 7 ©copyright Bruce Blumberg 2004-2007. All rights reserved The human genome • In Feb 12 2001, Celera and Human Genome project published “draft” human genome sequencs – Celera -> 39114 – Ensembl -> 29691 – Consensus from all sources ~30K • Number of genes – C. elegans – 19,000 – Arabidopsis - 25,000 • Predictions had been from 50-140k human genes – What’s up with that? – Are we only slightly more complicated than a weed? – How can we possibly get a human with less than 2x the number of genes as C. elegans – Implications? • UNRAVELING THE DNA MYTH: The spurious foundation of genetic engineering, Barry Commoner, Harpers Magazine Feb, 2002 BioSci D145 lecture 4 page 8 ©copyright Bruce Blumberg 2004-2007. All rights reserved The human genome • The answer – Gene sets don’t overlap completely (duh) – Floor is 42K – 130056build #236 UniGene Clusters (from EST and mRNA sequencing – http://www.ncbi.nlm.nih.gov/unigene – Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous years) • Important questions to be answered about what constitutes a “gene” – Crick genes? DNA-RNA-protein – How about RNAs? – miRNAs? – Antisense transcripts? BioSci D145 lecture 4 page 9 ©copyright = 42113 Bruce Blumberg 2004-2007. All rights reserved Genome sequencing(contd) – Whole genome shotgun sequencing (Celera) • premise is that rapid generation of draft sequence is valuable • why bother trying to clone and sequence difficult regions? – Basically just forget regions of repetitive DNA - not cost effective • using this approach, genomes rarely are completely finished – rule of thumb is that it takes at least as long to finish the last 5% as it took to get the first 95% • problems – sequence may never be complete as is C. elegans – much redundant sequence with many sparse regions and lots of gaps. – Fragment assembly for regions of highly repetitive DNA is dubious at best – “Finished” fly and human genomes lack more than a few already characterized genes BioSci D145 lecture 4 page 10 ©copyright Bruce Blumberg 2004-2007. All rights reserved The human genome • How finished is the human genome sequence? – Draft sequence to high coverage – Chromosome by chromosome finishing now • Chr 22 – 1999 • Chr 21 – 2000 • Chr 20 – 2001 • Chr 15 – 2003 • Chr 6,7,Y-2003 • Chr 13,19 -2004 • May 2006 – all finished BioSci D145 lecture 4 page 11 ©copyright Bruce Blumberg 2004-2007. All rights reserved Genome sequencing (contd) • Knowing what we know now – how to approach a large new genome? – Xenopus tropicalis 1.7 Gb (about ½ human) – BAC end sequencing – Whole genome shotgun – HAPPY mapping and radiation hybrid mapping to order scaffolds – Gaps closed with BACS – 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes) – Finishing now in process • But how “finished” will it be? We need to wait and see • 2011 – update. – Xenopus laevis – 454 sequencing to 4x and de novo assembly • 2015 update – now version 8.0 – 10x coverage – FINALLY integrated BAC end sequences – Integrated genetic map – 50% of contigs > 72 kb BioSci D145 lecture 5 page 12 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies • Sequencing by hybridization – Construct a high-density microchip with all possible combinations of a short oligonucleotide • Up to 25-mers • By photolithography – Synthesized on chip directly – Label and hybridize fragment to be sequenced – Wash stringently – Read fluorescent spots – Reconstruct sequence by computer BioSci D145 lecture 5 page 13 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Sequencing by hybridization rarely used for de novo sequencing – Extremely fast and useful to sequence something you already know the sequence of but want to identify mutation - resequencing – Disease causing changes • e.g in mitochondrial DNA – SNP discovery – Works best for examining sequence of <10 kb BioSci D145 lecture 5 page 14 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • http://www.affymetrix.com/products/arrays/index.affx • SNP discovery – Photo shows mitochondrial chip – Right panel shows pairs of normal (top) vs disease (bottom) (Leber’s Hereditary Optic Neuropathy) • Top 3 disease mutations • Bottom control with no change BioSci D145 lecture 5 page 15 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies – Next Generation sequencing • 2nd generation = high throughput, short sequences • 3rd generation = single molecule sequencing • Small number of sequence templates (thousands) but very long reads (~105 bp) • What is the immediate implication of this technology for genome assembly? We should now be able to completely sequence large insert clones directly and avoid fragmentation by repetitive elements! • Key review is Metzger, M.L. (2010) Sequencing technologies — the next generation, Nature Reviews Genetics 11, 31-46. BioSci D145 lecture 5 page 16 ©copyright Bruce Blumberg 2004-2007. All rights reserved 3rd generation Other sequencing technologies (contd) • Pyrosequencing – – http://www.454.com – Based on synthesis of complementary strand to a template (like Sanger) – Detection of elongation with chemiluminescence • Fragment genome to appropriate size (depends on application) • add adapters to each end • Isolate those with different adapters on each end • PCR to amplify BioSci D145 lecture 5 page 18 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Pyrosequencing (contd) – PCR – capture template on micro beads such that each bead gets 1 molecule of DNA – how? Use a large ratio of beads to DNA – Emulsify in water/oil microreactors – Amplify DNA – Break and recover DNA containing beads BioSci D145 lecture 5 page 19 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Pyrosequencing (contd) – Sequencing – load beads into picotiter wells • Add enzymes (sulfurylase and luciferase) • Run reaction – flow nucleotide/buffer solution across wells one at a time • Complementary nucleotide addition leads to light output – light output is proportional to # consecutive nucleotides BioSci D145 lecture 5 page 20 ©copyright Bruce Blumberg 2004-2007. All rights reserved Other sequencing technologies (contd) • Pyrosequencing (contd) – What is the point? • Can generate 400,000 reads in parallel (FLX) • Or > 1,000,000 (FLX Titanium) • Each read is 200-400 bp (FLX), or 400-600 (FLX Titanium) • So you can get – 8 x107 bp per run! (FLX) – 4-6 x 108 bp/run (FLX Titanium) • What is massively parallel sequencing good for? – – – – – – – Rapid sequencing of genomes, or resequencing of known sequences Ancient DNA (even dinosaurs? – Svante Pääbo says ~200K years is limit) ChIP-sequencing (week 6) Sequencing ESTs or other tags Determining microbial diversity in field samples Transcriptome sequencing Identifying variations in • Viral populations • Gene sequences in mixed populations BioSci D145 lecture 5 page 21 ©copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing • Idea is to sequence many copies of the same thing – Gene sequence – mRNA transcript BioSci D145 lecture 5 page 22 ©copyright Bruce Blumberg 2004-2007. All rights reserved Amplicon sequencing (contd) • What is amplicon sequencing good for? – Discovery of rare somatic mutations in complex samples (e.g., cancerous tumors - mixed with germline DNA) based on ultra-deep sequencing of amplicons – Sequencing collections of exons from populations of individuals to identify diversity – Sequencing collections of human exons from populations of individuals for the identification of rare alleles associated with disease – Analysis of viral quasispecies present within infected populations in the context of epidemiological studies – Evolutionary biology in populations BioSci D145 lecture 5 page 23 ©copyright Bruce Blumberg 2004-2007. All rights reserved Comparative genomics • Study of similarities and differences between genome structure and organization – How many genes? Chromosomes? – Genome duplications – Gene loss • Driving forces – Understanding evolution in molecular terms – Sequence annotation and function identification • Sequences with important functions tend to be conserved across evolution • Orthology vs paralogy – Homolog – descended from a common ancestor (Hox genes) – Orthologs - homologous genes in different organisms that encode proteins with the same function and which have evolved by direct vertical descent (frog and human Hoxa-1) – Paralogs - BioSci D145 lecture 6 homologous genes that encode proteins with related but non-identical functions (Hoxa-1, Hoxb-1, Hoxd-1) • Derived by gene duplication page 24 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) • Functional equivalency does not require homology, sequence similarity or even 3D structure – Same chemical reaction can be catalyzed by totally unrelated enzymes – Non-orthologous gene displacement – when nonorthologous genes encode the same essential cellular function • Better term would be analogous gene • Convergent evolution also sometimes used BioSci D145 lecture 6 page 25 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) • Genes with very different functions can be related – 3-D structure may indicate that proteins are related (evolved from the same ancestral protein) but sequence identity too low to detect • Expected when genes diverge from a distant common ancestor • < 20% amino acid sequence identity too little to establish homology (although proteins may be homologous) – For example • 3-D structures of – D-alanine ligase – Glutathione synthetase – ATP-binding domains of » Carbamoyl phosphate sythetase » Succinyl-CoA synthetase • Are all so similar in 3D structure that homology is not in doubt but sequence comparisons do not detect homology • Why should we care whether genes are related or not? Essential for understanding how evolution works at the molecular level BioSci D145 lecture 6 page 26 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) stopped here • Protein evolution – Observation – many proteins composed of discrete domains – Observation – many proteins have multiple domains shared with other proteins – Conclusion – domain shuffling must have occurred during evolution – Some correlation between exons and protein domains • Protein domains tend to be encoded in 1 or two exons • New combinations of protein domains can be created by recombination – LINEs – Between repetitive elements in introns • Exon shuffling – process of transferring exons (and hence functional domains) between proteins BioSci D145 lecture 6 page 27 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) • Protein evolution (contd) – Haemostatic (aka blood clotting) proteins as an exon shuffling paradigm • Family of proteases that are activated by proteolysis • Protein domains show strong correlation with exons BioSci D145 lecture 6 page 28 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) stopped here • Protein evolution (contd) – What is horizontal gene transfer – transfer of genes or protein domains across unrelated species • Frequently identifiable by different patterns of codon usage from other genes, particularly ribosomal proteins • Fairly rare with eukaryotes • Happens in prokaryotes all the time – Examples? – e.g., transfer of antiobiotic resistance among bacteria – Plasmid exchange, phage infections and transfer – Often associated with pathogenicity » Pathogenic variants of bacteria frequently have lots of inserted DNA » e.g., E. coli H0157 has 800 kb more than lab strains of E. coli, much of which is virulence factors, prophages and prophage like elements – What does this suggest about nature of virulence? Virulence is acquired, i.e, transferred from one organism to another BioSci D145 lecture 6 page 29 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) • Is there a minimal genome? How would you define “minimal genome”? – Encoding the essential set of proteins required for life? – Compare genomes of archebacteria, eubacteria and yeast • Issues with how genes are classified but a reasonably good approximation can be made • Can identify 322 clusters of orthologous groups required for all key biosynthetic pathways that might be required in free-living organisms – But remember about non-orthologous gene displacements! • Some lessons from bacterial genomics – Nearly half of ORFs are of unknown function – About 25% of all ORFs are unique to a particular species! • Suggests that many new protein families remain to be discovered • Many new functions may be uncovered – Periodic re-evaluation of sequenced genomes is useful • Compare with newly acquired data – Often find additional ORFs and genes – Much conservation of gene position • Same genes found in many genomes at same positions (good for evolutionary studies BioSci D145 lecture 6 page 30 ©copyright Bruce Blumberg 2010. All rights reserved Comparative genomics (contd) • What do we get from comparative genomics? – Powerful new tools to identify conserved sequences • important regulatory elements • Unidentified genes • Features (promoters, splice sites, etc) – Important information about genome evolution • Where did related genes originate? • When did genome duplications arise? • What is the history of life on earth? – And by implication, life elsewhere • What is the genetic diversity in wild populations – Environmental shotgun sequencing – Information required to identify gene function • Protein sequence and structure comparisons BioSci D145 lecture 6 page 31 ©copyright Bruce Blumberg 2010. All rights reserved Construction of cDNA libraries • What is a cDNA library? – Collection of DNA copies representing the expressed mRNA population of a cell, tissue, organ or embryo • What are they good for? – Identifying and isolating expressed mRNAs – functional identification of gene products – cataloging expression patterns for a particular tissue • EST sequencing and microarray analysis – Mapping gene boundaries • Promoters • Alternative splicing BioSci D145 lecture 3 page 32 ©copyright Bruce Blumberg 2007. All rights reserved Determinants of library quality • What constitutes a full-length cDNA? – Strictly, it is an exact copy of the mRNA – full-length protein coding sequence considered acceptable for most purposes • mRNA – full-length, capped mRNAs are critical to making full-length libraries – cytoplasmic mRNAs are best – WHY? They are processed, i.e., introns removed and poly A is added • 1st strand synthesis – complete first strand needs to be synthesized – issues about enzymes • 2nd strand synthesis – thought to be less difficult than 1st strand (probably not) • choice of vector – plasmids are best for EST sequencing and functional analysis – phages are best for manual screening BioSci D145 lecture 3 page 33 ©copyright Bruce Blumberg 2007. All rights reserved cDNA synthesis (stopped here – 2015) • Scheme – mRNA is isolated from source of interest – 1-10 μg are denatured and annealed to primer containing d(T)nV • To minimize length of poly A tail in libraries for sequencing – reverse transcriptase copies mRNA into cDNA – DNA polymerase I and Rnase H convert remaining mRNA into DNA – cDNA is rendered blunt ended – linkers or adapters are added for cloning – cDNA is ligated into a suitable vector – vector is introduced into bacteria • Caveats – there is lots of bad information out there • much is derived from vendors who want to increase sales of their enzymes or kits – all manufacturers do not make equal quality enzymes – most kits are optimized for speed at the expense of quality – small points can make a big difference in the final outcome BioSci D145 lecture 3 page 34 ©copyright Bruce Blumberg 2007. All rights reserved