Download Overview

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Minimal genome wikipedia , lookup

Molecular Inversion Probe wikipedia , lookup

Epigenomics wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Non-coding DNA wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Microsatellite wikipedia , lookup

Human genome wikipedia , lookup

Genome evolution wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

DNA sequencing wikipedia , lookup

Pathogenomics wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Human Genome Project wikipedia , lookup

RNA-Seq wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Exome sequencing wikipedia , lookup

Genomic library wikipedia , lookup

Metagenomics wikipedia , lookup

Genomics wikipedia , lookup

Transcript
NCSU Summer Institute of
Statistical Genetics, Raleigh 2004:
Genome Science
Session II: Genome Sequencing
Genome and EST sequencing
•Sequencing Technologies
•Informatics Tools
•Sequencing project approaches
•EST sequencing projects
•Genome sequencing projects
•What we have learned
The Summer Institute 2004
Some Terms
• Complementary – nucleotide sequences that will
form specific hybrids
• Hybridize – duplex formation
• Label – a molecular tag that facilitate detection
• Oligonucleotide – a short single-stranded piece of
nucleic acid
• Anneal – to incubate nucleic acid species together
under conditions that promote specific
hybridization
The Summer Institute 2004
Why study genomes
•Molecular biology and biochemistry need a
point of entry
•Genetics is reliant on phenotype
•Hypothesis driven versus data production parallels with early Naturalists and modern
day physics
•Identify similarities and differences amongst
diverse life forms
The Summer Institute 2004
Data mining vs. Data Dredging
The Summer Institute 2004
Gene structural features
Sequence read
cDNA
Genomic DNA
Exon
•Hybridization of complementary strands
•Specificity of base pairing
•Almost any DNA is clonable
•You can have the same sequence - but
different genes
polyA tail
The Summer Institute 2004
Sequencing Technologies
•Basic principles
•Dideoxy chain termination
•Electrophoretic separation
•Visualization
•Innovations
•Fluorescent tags
•Thermocycling
•Capillary electrophoresis
•Novel methodologies
•Sequencing by hybridization
•Mass spectrometry
•Nanopore sequencing
•Other things of note
The Summer Institute 2004
Primer extension
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
3’-AAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
3’-AAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
AAATCTAGCTAAGCT-5’
•The extended molecule is the reverse complement of the target
•The extended molecule can be tagged for visualization
•Extension occurs via a 3’ hydroxyl group
The Summer Institute 2004
Dideoxy chain termination
Dideoxy dNTPs will terminate extension because they
lack a 3’-OH
By mixing ddATP with dATP a pool of extension
products is created wherein termination at each
available A occurs
The termination products can be separated by size
and visualized by labeling either the ddNTP or the
primer
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
ATCGGTCAAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
ATCGATCGGTCAAATCTAGCTAAGCT-5’
5’-ACGTGATCCCAGTACCGTAGCTAGCCAGTTTAGATCGATTCGA-3’
ATCGATCGATCGGTCAAATCTAGCTAAGCT-5’
The Summer Institute 2004
Sanger sequencing
TC GA
AGCT
ddNTP/dNTP mixtures are made
up for each of the four nucleotides
- adenine, cytosine, guanine,
thymine
Proportion of dideoxy to deoxy
NTP determines the frequency of
termination
Products from the four reactions
are separated by size and DNA
sequence is inferred
Invert gel to read the sequence 5’ to 3’
The Summer Institute 2004
Fluorescent sequencing
Each ddNTP is labeled with a different
fluor - now all four products can be run
in the same gel lane
Fluorescence is detected using a laser
scanner to produce a false color image
Electropherograms (chromatograms)
are produced that display peak
intensity for each fluor
Can also differentially label the primer
to achieve the same end
The Summer Institute 2004
Cycle sequencing with PCR
Sanger sequencing can require
large amounts of template
Polymerase chain reaction
exponentially amplifies specific
DNAs
Use of ddNTPs allows the
combination of amplification and
dideoxy terminator sequencing
Cycle sequencing
animation
press
The Summer Institute 2004
High-throughput sequencing
•Dideoxy terminator sequencing is
robust and flexible
•Microtiter format
•PCR based cycle sequencing
requires less template
•Fluorescent sequencing increased
gel capacity 4X
•Supporting robotics upstream of
sequencing process
•Computational tools
•Capillary sequencers
The Summer Institute 2004
Capillary gels
Slab gels make life in the sequencing lab difficult for many reasons:
Pouring the gel is time consuming and prone to error
The microtiter plate format (sequencing reactions) has spacing that is different
than the gel loading comb - cumbersome
Assembly and disassembly of the sequencing apparatus is messy and time
consuming
Manual lane tracking is time consuming and prone to error
Gels never run perfectly - lanes can sometimes run together making lane
tracking difficult
Capillary gels help because
Each sequencing reaction is run in a separate capillary - there is no lane
tracking to worry over
Matrix for the capillary gel is robotically assembled, injected and QC’d
Robotic loading of samples is compatible with walk-away capability
The Summer Institute 2004
Informatics essentials
•Basecallers convert trace data to sequence
•Assemblers form contiguous sequences from small
chunks
•Viewers/Editors allow the scientist to interactively
work with data
•Databases store sequencing data - from
electropherograms to annotation
•Analysis tools compare the sequence against
databases of sequences and use algorithms to
make educated guesses about the structure and
function of a given sequence
The Summer Institute 2004
Basecalling
•Is the spacing of the peaks what is expected?
•Is there a peak in the electropherogram?
•What fluor is responsible for this peak?
•Since noise ensures the presence of more than one peak, which peak
is the correct peak?
•What is the probability that the base that is assigned is the correct
base?
•Phred score - Phred 20 (1 error in 100 bases) is a typical quality
standard
•TraceTuner - algorithm is similar to Phred but reportedly more accurate
with ABI3700 traces, plus accelerated execution
•Others are available
The Summer Institute 2004
Assembly
• Production of a single contiguous
sequence from multiple sequence
reads
• The best assembly programs
(including Phrap) use probability
scores directly from the output of
basecallers such as Phred
• Phrap was designed for genome
sequencing projects - EST
assemblies make different
assumptions
• Final assembly products include
contigs and singletons
• Accuracy of the contig consensus
sequence is based on error models
propagated from basecalling
software
The Summer Institute 2004
Viewers/editors
Consed
break
The Summer Institute 2004
Storage and analysis of sequence
• The amount of sequence information deposited in databases is increasing at a
very rapid rate
• Tools to manage sequence data are imperfect and in development
• Development of controlled vocabularies and gene ontologies will facilitate
database integration
• Analytical tools and algorithm development are growth industries
http://www.ebi.ac.uk/Databases/index.html
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi
http://www.ncbi.nlm.nih.gov/
The Summer Institute 2004
Impact of database structure
•Flat file databases are great for speed but are not built for integration
•Lack of controlled vocabularies impedes efficient and reliable searching
and inhibits integration
•GenBank uses a controlled index and vocabulary - sort of
•Example of searching for genomic sequence, EST sequence and
complete cDNAs
•Relational databases are great for integration but can be slow and
changing the schema takes an act of Congress
•Flat file databases with robust re-indexing routines have the advantage
of speed and the ability to integrate different data types
The Summer Institute 2004
Sequencing by hybridization
TGCATGCATGCA
TGCATG…1
GCATGC…3
CATGCA…9
ATGCAT…2
TGCATG…12
GCATGC…3
CATGCA…9
ACGTAC
TTCCGG
CGTACG
CGCGGA
TACGTA
AATTCC
CACAGA
AGCAGC
CGTACG
GGGCCC
GTACGT
ACGTAC
1
2
3
4
5
6
7
8
9
10
11
12
Determine constituent sequences by hybridizing to oligos of known sequence
Assemble sequence fragments into contiguous sequence
The Summer Institute 2004
Sequencing by mass spectrometry
TGCATG
GCATGC
CATGCA
ATGCAT
TGCATG
GCATGC
Obtain Mass spectra from
Reference panel of oligos
Fragment unknown and obtain
mass spectra
Deconvolute data
The Summer Institute 2004
454
The Summer Institute 2004
US Genomics
The Summer Institute 2004
Nanopore sequencing
•Two solution filled compartments
separated by a membrane with a
channel
•Ions flow through the channel in
response to an applied voltage
•DNA is negatively charged and will
be drawn through the channel
•Channel size allows DNA
molecules to be drawn into and
through the channel one at a time
•Current is reduced when the
channel is occupied by DNA
•Length of current drop is
proportional to length of DNA
•Extent of current drop is indicative
of physicochemical properties of
DNA - thus, one can infer
sequence from the trace
The Summer Institute 2004
Sequencing project approaches
•EST projects
•Map-based: assembly based on physical
ordering of clones
•Shotgun: assembly based on computational
ordering of sequences
•Combination strategies: minimal scaffolding
from physical maps, fill in the blanks by
shotgun and directed sequencing
The Summer Institute 2004
EST sequencing projects
•Only the expressed genome is sequenced,
thereby avoiding the “junk”
•Relatively inexpensive and fast - accessible to
small laboratories
•May fail to capture many genes because the
appropriate biological condition leading to
expression is not captured
•May overestimate gene number due to nonoverlapping sequences from the same gene
The Summer Institute 2004
Project is the operative word
The Summer Institute 2004
Libraries of overlapping clones
•Library clones can be ordered by
the presence of restriction sites,
known sequences, etc.
•Assembly of contiguous
sequences is straightforward
because the clones form an
ordered array
The Summer Institute 2004
Map-based sequencing
•Produce large insert libraries in BACs, cosmids, etc. to “cover”
the genome multiple times
•Determine a minimal tiling path of clones by restriction
mapping, hybridization of end based probes or end sequencing
•Ordered sets of clones are subcloned into pools of small
clones
•Smaller clones can be order or sequenced by shotgun
methods
•Fewer sequencing runs = lower costs
•Obtaining an ordered array of clones can be time consuming
The Summer Institute 2004
Shotgun sequencing
• Produce sequences from random clones irrespective of their physical order along
the chromosomes
• Clones can be small insert or large insert because alignment takes into account only
the sequence - not properties of the physical clones
• Assemble sequences to produce contigs
• Identify gaps in contiguous sequence and undersequenced areas
• Perform directed sequencing to fill in the gaps
The Summer Institute 2004
Shotgun sequencing issues
•Assembly is computationally intensive
•Repetitive sequences have to be masked so that they do not
confound the preliminary alignment
•First pass alignment based upon non-masked sequences to
produce contiguous sequence fragments
•Alignments must account for potential polymorphisms
•Repetitive sequences still need to be aligned - their treatment
is however distinct from non-repetitive sequences
•Resolution of conflicts in the assembly is challenging
•When is a genome truly finished?
•The press release is only the beginning of the process
The Summer Institute 2004
Complementary strategies
• Pure shotgun approaches are
likely to leave significant gaps
• Directed sequencing of specific
regions is necessary to fill in the
gaps
• Pure map-based strategies are
cumbersome and time consuming
and do not take advantage of
efficiencies of scale found in
modern industrial sequencing
• A complementary approach
combines data from both
approaches
• There are adherents to working
from the bottom-up and working
from the top-down
The Summer Institute 2004
Genome sequencing projects
The Summer Institute 2004
What is a gene
•ESTs and cDNAs identify those parts of the genome
that are actually transcribed
•Transcripts have structural features including starts,
stops and open reading frames
•Computers can be trained to “sniff” for relevant
features in the sequence
•Genefinding algorithms construct probability models
based on presence of one or more gene-like features
•Coordination with genetic features gives a comfort
level because it is empirical
•Computational methods that rely on similarity to
“known” genes in databases can be perilous - a sort
of regressive uncertainty
The Summer Institute 2004
BLAST
Example Sequence
BLAST Example
The Summer Institute 2004
How to make a human
The Summer Institute 2004
The Human Genome Project
http://www.nature.com/genomics/human/index.html
http://www.sciencemag.org/content/vol291/issue5507/
The Summer Institute 2004
Genome information challenges
• Data integration from sequence, mutant analysis, mapping, expression analysis,
metabolic profiling, and other data types will be the primary challenge to biological
science in the 21st century
• Informatics tools are in their infancy
• The literature is growing at a rate surpassing sequence data
• Importance of statistics cannot be overstated
• Gene annotation is regressive
• Danger of balkanization of data?
• Is natural language processing the holy grail?
Link to Ensembl
Link to FlyBase
Link to Entrez Genomes
Link to ExPASy
Link to SachDB
Link to KEGG
Link to TAIR
The Summer Institute 2004
Genome information challenges
• Data integration from sequence, mutant analysis, mapping, expression analysis,
metabolic profiling, and other data types will be the primary challenge to biological
science in the 21st century
• Informatics tools are in their infancy
• The literature is growing at a rate surpassing sequence data
• Importance of statistics cannot be overstated
• Gene annotation is regressive
• Danger of balkanization of data?
• Is natural language processing the holy grail?
Link to Ensembl
Link to FlyBase
Link to Entrez Genomes
Link to ExPASy
Link to SachDB
Link to KEGG
Link to TAIR
The Summer Institute 2004