Download No Slide Title

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Microsatellite wikipedia , lookup

DNA sequencing wikipedia , lookup

Helitron (biology) wikipedia , lookup

Exome sequencing wikipedia , lookup

Human Genome Project wikipedia , lookup

Transcript
BioSci D145 Lecture #4
• Bruce Blumberg ([email protected])
– 4103 Nat Sci 2 - office hours Tu, Th 3:30-5:00 (or by appointment)
– phone 824-8573
• TA – Bassem Shoucri ([email protected])
– 4351 Nat Sci 2, 824-6873, 3116 Office hours M 2-4
• lectures will be posted on web pages after lecture
– http://blumberg.bio.uci.edu/biod145-w2015
– http://blumberg-lab.bio.uci.edu/biod145-w2015
–
Last year’s midterm is now posted. It is useful to work through the problems
to see what sort of questions I ask and how best to study.
–
Term paper outlines due Thursday (1/29) by midnight. Please upload to the
drop box (1st choice) or e-mail to me (2nd choice).
BioSci D145 lecture 1
page 1
©copyright
Bruce Blumberg 2010. All rights reserved
Term paper outline
• Title of your proposal
• A paragraph introducing your topic and explaining why it is important; i.e.,
what impact will the knowledge gained have.
– Why should any funding agency give you money to pursue this research?
• NIH now requires a statement of human health relevance for all grant
applications
• NSF wants to know what is the intellectual merit of your proposed
research and what broader impacts of your proposed research
• Present your hypothesis
– A supposition or conjecture put forth to account for known facts; esp. in
the sciences, a provisional supposition from which to draw conclusions
that shall be in accordance with known facts, and which serves as a
starting-point for further investigation by which it may be proved or
disproved and the true theory arrived at.
• Enumerate 2-3 specific aims in the form of questions that test your
hypothesis
– At least one of these aims needs to have a strong “whole genome”
component
BioSci D145 lecture 4
page 2
©copyright
Bruce Blumberg 2004-20014. All rights reserved
Genome sequencing
• The problem
– Genome sizes for most eukaryotes are large (108-109 bp)
– High quality sequences only about 600-800 bp per run
• The solution
– Break genome into lots of bits and sequence them all
– Reassemble with computer
• The benefit
– Rapid increase in information about genome size, gene comparisons, etc
• The cost
– 3 x 109 bp(human haploid genome) ÷ 600 bp/reaction = 5 x 106 reactions
for 1x coverage!
– Need both strands (x2), need overlaps and need to be sure of sequences
– ~107-108 reactions/runs required for a human-sized genome
– About $1-2 per reaction these days, ~$8 commercially.
BioSci D145 lecture 4
page 3
©copyright
Bruce Blumberg 2004-2014. All rights reserved
Genome sequencing(contd)
BioSci D145 lecture 4
page 4
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Genome sequencing (contd)
• Shotgun sequencing (contd)
– How to minimize sequence redundancy?
• Best way to minimize redundancy is map before you start
– C. elegans was done this way - when the sequence was finished,
it was FINISHED
» mapping took almost 10 years
– mapping much too tedious and nonprofitable for Celera
» who cares about redundancy, let’s sequence and make $$
» There is scientific value to draft genomes, too.
• why does redundancy matter?
– Finished sequence today costs
about $0.50/base
– Note that at 10x, 99.995%
coverage leaves at least
150 kb of the human genome
unsequenced
BioSci D145 lecture 4
page 5
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Genome sequencing (contd)
– Mapping by hybridization
– Mapping by fingerprinting
BioSci D145 lecture 4
page 6
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Traditional (map first) vs STC (map as you go along) mapping
BioSci D145 lecture 4
page 7
©copyright
Bruce Blumberg 2004-2007. All rights reserved
The human genome
• In Feb 12 2001, Celera and Human Genome project published “draft” human
genome sequencs
– Celera -> 39114
– Ensembl -> 29691
– Consensus from all sources ~30K
• Number of genes
– C. elegans – 19,000
– Arabidopsis - 25,000
• Predictions had been from 50-140k human genes
– What’s up with that?
– Are we only slightly more complicated than a weed?
– How can we possibly get a human with less than 2x the number of genes
as C. elegans
– Implications?
• UNRAVELING THE DNA MYTH: The spurious foundation of genetic
engineering, Barry Commoner, Harpers Magazine Feb, 2002
BioSci D145 lecture 4
page 8
©copyright
Bruce Blumberg 2004-2007. All rights reserved
The human genome
• The answer – Gene sets don’t overlap completely (duh)
– Floor is 42K
– 130056build #236 UniGene Clusters (from EST and mRNA sequencing
– http://www.ncbi.nlm.nih.gov/unigene
– Up from 123,459 in 2013 (85,793, 105,680, 128,826, 123,891 previous
years)
• Important questions to be
answered about what
constitutes a “gene”
– Crick genes?
DNA-RNA-protein
– How about RNAs?
– miRNAs?
– Antisense transcripts?
BioSci D145 lecture 4
page 9
©copyright
= 42113
Bruce Blumberg 2004-2007. All rights reserved
Genome sequencing(contd)
– Whole genome shotgun sequencing (Celera)
• premise is that rapid generation of draft sequence is valuable
• why bother trying to clone and sequence difficult regions?
– Basically just forget regions of repetitive DNA - not cost effective
• using this approach, genomes rarely are completely finished
– rule of thumb is that it takes at least as long to finish the last 5%
as it took to get the first 95%
• problems
– sequence may never be complete as is C. elegans
– much redundant sequence with many sparse regions and lots of
gaps.
– Fragment assembly for regions of highly repetitive DNA is dubious
at best
– “Finished” fly and human genomes lack more than a few already
characterized genes
BioSci D145 lecture 4
page 10
©copyright
Bruce Blumberg 2004-2007. All rights reserved
The human genome
• How finished is the human genome sequence?
– Draft sequence to high coverage
– Chromosome by chromosome finishing now
• Chr 22 – 1999
• Chr 21 – 2000
• Chr 20 – 2001
• Chr 15 – 2003
• Chr 6,7,Y-2003
• Chr 13,19 -2004
• May 2006 – all finished
BioSci D145 lecture 4
page 11
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Genome sequencing (contd)
• Knowing what we know now – how to approach a large new genome?
– Xenopus tropicalis 1.7 Gb (about ½ human)
– BAC end sequencing
– Whole genome shotgun
– HAPPY mapping and radiation hybrid mapping to order scaffolds
– Gaps closed with BACS
– 8.5 x coverage (but > 9000 scaffolds for 18 chromosomes)
– Finishing now in process
• But how “finished” will it be? We need to wait and see
• 2011 – update.
– Xenopus laevis – 454 sequencing to 4x and de novo assembly
• 2015 update – now version 8.0 – 10x coverage
– FINALLY integrated BAC end sequences
– Integrated genetic map
– 50% of contigs > 72 kb
BioSci D145 lecture 5
page 12
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies
• Sequencing by hybridization
– Construct a high-density
microchip with all possible
combinations of a short
oligonucleotide
• Up to 25-mers
• By photolithography
– Synthesized on
chip directly
– Label and hybridize
fragment to be sequenced
– Wash stringently
– Read fluorescent spots
– Reconstruct sequence
by computer
BioSci D145 lecture 5
page 13
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Sequencing by hybridization rarely used for de novo sequencing
– Extremely fast and useful to sequence something you already know the
sequence of but want to identify mutation - resequencing
– Disease causing changes
• e.g in mitochondrial DNA
– SNP discovery
– Works best for examining sequence of <10 kb
BioSci D145 lecture 5
page 14
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• http://www.affymetrix.com/products/arrays/index.affx
• SNP discovery
– Photo shows
mitochondrial chip
– Right panel shows pairs
of normal (top) vs
disease (bottom)
(Leber’s Hereditary
Optic Neuropathy)
• Top 3 disease
mutations
• Bottom control
with no change
BioSci D145 lecture 5
page 15
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies – Next Generation sequencing
• 2nd generation = high throughput, short sequences
• 3rd generation = single molecule sequencing
• Small number of sequence templates (thousands) but very long reads
(~105 bp)
• What is the immediate implication of this technology for genome
assembly?
We should now be able to completely sequence large insert clones
directly and avoid fragmentation by repetitive elements!
• Key review is Metzger, M.L. (2010) Sequencing technologies — the next
generation, Nature Reviews Genetics 11, 31-46.
BioSci D145 lecture 5
page 16
©copyright
Bruce Blumberg 2004-2007. All rights reserved
3rd generation
Other sequencing technologies (contd)
• Pyrosequencing –
– http://www.454.com
– Based on synthesis of complementary strand to a template (like Sanger)
– Detection of elongation with chemiluminescence
• Fragment genome to appropriate size (depends on application)
• add adapters to each end
• Isolate those with different adapters on each end
• PCR to amplify
BioSci D145 lecture 5
page 18
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Pyrosequencing (contd)
– PCR – capture template on micro beads such that each bead gets 1
molecule of DNA – how? Use a large ratio of beads to DNA
– Emulsify in water/oil microreactors
– Amplify DNA
– Break and recover DNA containing beads
BioSci D145 lecture 5
page 19
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Pyrosequencing (contd)
– Sequencing – load beads into picotiter wells
• Add enzymes (sulfurylase and luciferase)
• Run reaction – flow nucleotide/buffer
solution across wells one at a time
• Complementary nucleotide addition
leads to light output
– light output is proportional
to # consecutive nucleotides
BioSci D145 lecture 5
page 20
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Other sequencing technologies (contd)
• Pyrosequencing (contd)
– What is the point?
• Can generate 400,000 reads in parallel (FLX)
• Or > 1,000,000 (FLX Titanium)
• Each read is 200-400 bp (FLX), or 400-600 (FLX Titanium)
• So you can get
– 8 x107 bp per run! (FLX)
– 4-6 x 108 bp/run (FLX Titanium)
• What is massively parallel sequencing good for?
–
–
–
–
–
–
–
Rapid sequencing of genomes, or resequencing of known sequences
Ancient DNA (even dinosaurs? – Svante Pääbo says ~200K years is limit)
ChIP-sequencing (week 6)
Sequencing ESTs or other tags
Determining microbial diversity in field samples
Transcriptome sequencing
Identifying variations in
• Viral populations
• Gene sequences in mixed populations
BioSci D145 lecture 5
page 21
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Amplicon sequencing
• Idea is to sequence many copies of the same thing
– Gene sequence
– mRNA transcript
BioSci D145 lecture 5
page 22
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Amplicon sequencing (contd)
• What is amplicon sequencing good for?
– Discovery of rare somatic mutations in complex samples (e.g., cancerous
tumors - mixed with germline DNA) based on ultra-deep sequencing of
amplicons
– Sequencing collections of exons from populations of individuals to
identify diversity
– Sequencing collections of human exons from populations of individuals
for the identification of rare alleles associated with disease
– Analysis of viral quasispecies present within infected populations in the
context of epidemiological studies
– Evolutionary biology in populations
BioSci D145 lecture 5
page 23
©copyright
Bruce Blumberg 2004-2007. All rights reserved
Comparative genomics
• Study of similarities and differences between genome structure and
organization
– How many genes? Chromosomes?
– Genome duplications
– Gene loss
• Driving forces
– Understanding evolution in molecular terms
– Sequence annotation and function identification
• Sequences with important functions tend to be conserved across
evolution
• Orthology vs paralogy
– Homolog – descended from a common ancestor (Hox genes)
– Orthologs - homologous genes in different organisms that encode
proteins with the same function and which have evolved by
direct vertical descent (frog and human Hoxa-1)
– Paralogs -
BioSci D145 lecture 6
homologous genes that encode proteins with related but
non-identical functions (Hoxa-1, Hoxb-1, Hoxd-1)
• Derived by gene duplication
page 24
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd)
• Functional equivalency does
not require homology,
sequence similarity or even 3D
structure
– Same chemical reaction
can be catalyzed by totally
unrelated enzymes
– Non-orthologous gene
displacement – when nonorthologous genes encode
the same essential cellular
function
• Better term would be
analogous gene
• Convergent evolution
also sometimes used
BioSci D145 lecture 6
page 25
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd)
• Genes with very different functions can be related
– 3-D structure may indicate that proteins are related (evolved from the
same ancestral protein) but sequence identity too low to detect
• Expected when genes diverge from a distant common ancestor
• < 20% amino acid sequence identity too little to establish homology
(although proteins may be homologous)
– For example
• 3-D structures of
– D-alanine ligase
– Glutathione synthetase
– ATP-binding domains of
» Carbamoyl phosphate sythetase
» Succinyl-CoA synthetase
• Are all so similar in 3D structure that homology is not in doubt but
sequence comparisons do not detect homology
• Why should we care whether genes are related or not?
Essential for understanding how evolution works at the molecular level
BioSci D145 lecture 6
page 26
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd) stopped here
• Protein evolution
– Observation – many proteins composed of discrete domains
– Observation – many proteins have multiple domains shared with other
proteins
– Conclusion – domain shuffling must have occurred during evolution
– Some correlation between exons and
protein domains
• Protein domains tend to be encoded
in 1 or two exons
• New combinations of protein domains
can be created by recombination
– LINEs
– Between repetitive elements
in introns
• Exon shuffling – process of transferring
exons (and hence functional domains)
between proteins
BioSci D145 lecture 6
page 27
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd)
• Protein evolution (contd)
– Haemostatic (aka blood clotting) proteins as an exon shuffling paradigm
• Family of proteases that are activated by proteolysis
• Protein domains show strong correlation with exons
BioSci D145 lecture 6
page 28
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd) stopped here
• Protein evolution (contd)
– What is horizontal gene transfer – transfer of genes or protein domains
across unrelated species
• Frequently identifiable by different patterns of codon usage from
other genes, particularly ribosomal proteins
• Fairly rare with eukaryotes
• Happens in prokaryotes all the time – Examples?
– e.g., transfer of antiobiotic resistance among bacteria
– Plasmid exchange, phage infections and transfer
– Often associated with pathogenicity
» Pathogenic variants of bacteria frequently have lots of
inserted DNA
» e.g., E. coli H0157 has 800 kb more than lab strains of E.
coli, much of which is virulence factors, prophages and
prophage like elements
– What does this suggest about nature of virulence?
Virulence is acquired, i.e, transferred from one organism
to another
BioSci D145 lecture 6
page 29
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd)
• Is there a minimal genome? How would you define “minimal genome”?
– Encoding the essential set of proteins required for life?
– Compare genomes of archebacteria, eubacteria and yeast
• Issues with how genes are classified but a reasonably good
approximation can be made
• Can identify 322 clusters of orthologous groups required for all key
biosynthetic pathways that might be required in free-living organisms
– But remember about non-orthologous gene displacements!
• Some lessons from bacterial genomics
– Nearly half of ORFs are of unknown function
– About 25% of all ORFs are unique to a particular species!
• Suggests that many new protein families remain to be discovered
• Many new functions may be uncovered
– Periodic re-evaluation of sequenced genomes is useful
• Compare with newly acquired data
– Often find additional ORFs and genes
– Much conservation of gene position
• Same genes found in many genomes at same positions (good for
evolutionary studies
BioSci D145 lecture 6
page 30
©copyright
Bruce Blumberg 2010. All rights reserved
Comparative genomics (contd)
• What do we get from comparative genomics?
– Powerful new tools to identify conserved sequences
• important regulatory elements
• Unidentified genes
• Features (promoters, splice sites, etc)
– Important information about genome evolution
• Where did related genes originate?
• When did genome duplications arise?
• What is the history of life on earth?
– And by implication, life elsewhere
• What is the genetic diversity in wild populations
– Environmental shotgun sequencing
– Information required to identify gene function
• Protein sequence and structure comparisons
BioSci D145 lecture 6
page 31
©copyright
Bruce Blumberg 2010. All rights reserved
Construction of cDNA libraries
• What is a cDNA library?
– Collection of DNA copies representing the expressed mRNA population of
a cell, tissue, organ or embryo
• What are they good for?
– Identifying and isolating expressed mRNAs
– functional identification of gene products
– cataloging expression patterns for a particular tissue
• EST sequencing and microarray analysis
– Mapping gene boundaries
• Promoters
• Alternative splicing
BioSci D145 lecture 3
page 32
©copyright
Bruce Blumberg 2007. All rights reserved
Determinants of library quality
• What constitutes a full-length cDNA?
– Strictly, it is an exact copy of the mRNA
– full-length protein coding sequence considered acceptable for most
purposes
• mRNA
– full-length, capped mRNAs are critical to making full-length libraries
– cytoplasmic mRNAs are best – WHY?
They are processed, i.e., introns removed and poly A is added
• 1st strand synthesis
– complete first strand needs to be synthesized
– issues about enzymes
• 2nd strand synthesis
– thought to be less difficult than 1st strand (probably not)
• choice of vector
– plasmids are best for EST sequencing and functional analysis
– phages are best for manual screening
BioSci D145 lecture 3
page 33
©copyright
Bruce Blumberg 2007. All rights reserved
cDNA synthesis (stopped here – 2015)
• Scheme
– mRNA is isolated from source of interest
– 1-10 μg are denatured and annealed to primer containing d(T)nV
• To minimize length of poly A tail in libraries for sequencing
– reverse transcriptase copies mRNA into cDNA
– DNA polymerase I and Rnase H convert remaining mRNA into DNA
– cDNA is rendered blunt ended
– linkers or adapters are added for cloning
– cDNA is ligated into a suitable vector
– vector is introduced into bacteria
• Caveats
– there is lots of bad information out there
• much is derived from vendors who want to increase sales of their
enzymes or kits
– all manufacturers do not make equal quality enzymes
– most kits are optimized for speed at the expense of quality
– small points can make a big difference in the final outcome
BioSci D145 lecture 3
page 34
©copyright
Bruce Blumberg 2007. All rights reserved