Download ch4-TheGenomicBiologistsToolKit_1.3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Maurice Wilkins wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Comparative genomic hybridization wikipedia , lookup

Exome sequencing wikipedia , lookup

Agarose gel electrophoresis wikipedia , lookup

Genome evolution wikipedia , lookup

Replisome wikipedia , lookup

RNA-Seq wikipedia , lookup

DNA sequencing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

Transformation (genetics) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

DNA supercoil wikipedia , lookup

DNA vaccination wikipedia , lookup

Restriction enzyme wikipedia , lookup

Molecular evolution wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Non-coding DNA wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Molecular cloning wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Chapter 4. The Genomic Biologist’s Toolkit
Contents
4. Genomic Biologists tool kit
4.1. Restriction Endonucleases – making “sticky ends”
4.2. Cloning Vectors
4.2.1. Simple Cloning Vectors
4.2.2. Expression Vectors
4.2.3. Shuttle Vectors
4.2.4. Phage Vectors
4.2.5. Artificial Chromosome Vectors
4.3. Methods for Sequence Amplification
4.3.1. Polymerase Chain Reaction
4.3.2. Cloning Recombinant DNA
4.3.3. Cloning DNA in Expression Vectors
4.3.4. Making Complementary DNA (cDNA)
4.3.5. Cloning a cDNA Library
4.4. Genomic Libraries
4.4.1. Cloning in YAC Vectors
4.4.2. Cloning in BAC Vectors
4.5. DNA sequencing
4.5.1. Electrophoresis
4.5.2. Sanger Dideoxy Sequencing
4.5.3. Capillary Sequencers
4.5.4. Next Generation Sequencing
4.5.5. 3rd Generation Sequencing
4.6. DNA Sequencing Strategies
4.6.1. Map-based Strategies
4.6.2. Whole Genome Shotgun Sequencing
4.7. Genome Annotation
4.7.1. Using Bioinformatic Tools to Identify Putative
Protein Coding Genes
4.7.2. Comparison of predicted sequences with known
sequences (at NCBI)
4.7.3. Published Genomes
CONCEPTS OF GENOMIC BIOLOGY
Page 4-1
4.1. RESTRICTION ENDONUCLEASES
 CHAPTER 4. THE GENOMIC BIOLOGIST’S
TOOLKIT
(RETURN)
Genomic Biology has 3 important branches, i.e.
Structural Genomics, Comparative genomics, and
Functional genomics. The ultimate goal of these
branches is, respectively; the sequencing of genes and
genomes; the comparison of these sequenced genes
and genomes, and an understanding of how genes and
genomes work to produce the complex phenotypes of
all organisms.
A set of molecular genetic technologies was/is critical
to our ability to pursue the goals described above. The
Genomic Biologists Tool Kit is provides a brief
understanding of these critical tools, and how they are
used in the investigation of genomes. While the
techniques are intrinsically laboratory tools, the nature
of what they can do and how they work can be readily
studied using bioinformatic resources.
(RETURN)
Restriction endonucleases (restriction enzymes) each
recognize a specific DNA sequence (restriction site), and
break a phosphodiester linkage between a 3’ carbon and
phosphate within that sequence. Restriction enzymes
are used to create DNA fragments for cloning and to
analyze positions of restriction sites in cloned or
genomic DNA. A specific restriction enzyme digests cut
DNA at the same sites in every molecule if allowed to
cut to completion. Thus, this is a method whereby all
copies of genomes or any other longer sequence can be
reproducibly cut into identical fragments.
The first three letters of the name of a restriction
enzyme are derived from the genus and species of the
organism from which it was isolated. Additional letters
often denote the bacterial strain from which the
restriction enzyme was isolated, and if multiple enzymes
are isolated from the same strain, they are given Roman
numerals. For example, the restriction enzyme EcoRI, is
the first enzyme isolated from the RY13-strain of
Escherichia coli.
Bacteria produce restriction endonucleases to defend
against bacteriophages (viruses), and each restriction
CONCEPTS OF GENOMIC BIOLOGY
Page 4-2
Table 4.1. Characteristics of Some Restriction Enzymes
CONCEPTS OF GENOMIC BIOLOGY
Page 4-3
enzyme recognizes a completely unique DNA sequence
where it cuts the DNA strands (see Table 4.1 & Figure
4.1). The specific restriction enzyme recognition sites in
the bacterial DNA are often limited in the genome of the
organism from which it comes, but they are abundant in
the genome of the bacteriophage. Also the DNA of the
host cell can be modified by methylation, which
prevents the restriction enzymes of the host cell
Figure 4.1. Restriction site sequences and
cut locations of: a) SmaI; b) BamHI, and c)
PstI.
from degrading host cell DNA, while invading bacteriophage DNA is unmethylated and readily degraded.
Many restriction sites are sequences of 4, 6, or 8
base pairs in length and have identical sequences from
5’ to 3’ on each strand. These sequences are referred to
as palindromic DNA sequences. Other restriction sites
are not completely symmetrical and/or differ in length
from 4, 6, or 8 nucleotide pairs (Table 4.1 & Figure 4.1).
As shown in the figure on the left, the nature of the
fragment ends produced when a restriction enzyme
produces DNA fragments can vary. Some enzymes
produce fragments where the two strands are equal in
length. This is referred to as blunt ends. Other enzymes
produce fragments where the two strands are unequal
in length. These are referred to as either 5’ sticky ends,
or 3’ sticky ends. Overhanging sticky ends provide a
basis for combining DNA fragments produced by the
same restriction enzyme from different DNA sources.
This process was the original method used to produce
recombinant DNA molecules.
The application of restriction endonucleases to the
cloning of DNA is further discussed in DNA Cloning video
that can be viewed by clicking on the link. Note that
part of this video will be discussed in detail in the next
section of the Genomic Biologist’s Toolkit, but the first
part of the video is a good demonstration of how
CONCEPTS OF GENOMIC BIOLOGY
Page 4-4
restriction enzymes work and how they can be used to
create recombinant DNA molecules for cloning DNA.
Figure 4.2. Using restriction enzyme, EcoRI to make recombin-ant
DNA. The procedure relies on the 3’-overhanging “sticky ends”.
An additional application of restriction enzymes
involves the production of a res-triction map. A
restriction map is shows the relative position of
restriction sites for multiple restriction enzymes in a
piece of linear or circular DNA. Prior to the availability
of genomic sequences, restriction mapping was an
important tool used to characterize cloned DNA
fragments. The production of a restriction map for a
circular DNA is shown in the Restriction Mapping video.
Note that we have previously discussed SNPs as a
type of Sequence Tagged Site (STS). As single nucleotide
changes in the genome sequence, consider the effect of
an SNP that happens to occur in a restriction
endonuclease recognition site. The result would be the
loss of a restriction site at that SNP. This site would no
longer be cut by the enzyme, and thus new fragments
having different sizes would be produced. This is called
Restriction Fragment Length Polymorphism (RFLP).
Thus, and RFLP is an SNP that happens to occur in a
restriction site in the DNA. A famous RFLP is associated
with Sickle Cell Disease, and is further described in the
accompanying video.
4.2. CLONING VECTORS
(RETURN)
The process of “DNA cloning” involves a set of
experimental methods in molecular biology that are
used to assemble recombinant DNA molecules and to
direct their replication within host organisms. The use
of the word cloning refers to the fact that the method
involves the replication of one molecule to produce a
population of cells with identical DNA molecules.
Molecular cloning generally uses DNA sequences from
two different organisms: 1) the organism that is the
source of the DNA to be cloned, and 2) the organism
that will serve as the living host for replication of the
CONCEPTS OF GENOMIC BIOLOGY
Page 4-5
recombinant DNA. Molecular cloning methods are
central to many areas of biology, biotechnology, and
medicine, including DNA sequencing.
The DNA from host organism in a cloning
experiment, often called a vector, typically has 3 things:
1) Sequences necessary to produce recombinant DNA
and facilitate entry into the host organism. Typically,
this can be one or more “unique” restriction sites.
“Unique” in this context means that these are
restriction sites will permit cutting the vector at only
one location. Most vectors contain unique restriction
sites for a number of different restriction enzymes.
This is called a polylinker or multiple cloning site, and
can make the use of the vector much easier.
2) An origin of replication for the host organism to
facilitate replication of the recombinant DNA in the
host cell. Typically this sequence controls the
number of copies of the vector that can be made in
one cell.
3) In order to facilitate identification of cells that contain
the vector containing recombinant DNA, a gene that
can be expressed in the host and that provides a
“selectable” marker for the presence of recombinant
DNA is provided. Often the selectable marker gene
will be a gene that makes cells resistant to a specific
antibiotic or that permits cells to make an amino acid
required for growth.
These are the basic requirements that all modern
cloning vectors contain, but beyond these basic
requirements, there can be a number of additional
features that make specific vectors useful for various
purposes. Thus, several types of cloning vectors have
been constructed, each with different molecular
properties and cloning capacities.
4.2.1. Simple Cloning Vectors
(RETURN)
The most common vectors are used to clone
recombinant DNA in bacterial cells, typically E. coli.
Simple cloning vectors are constructed from plasmids
common in many bacterial cells. In fact plasmids are
circles of dsDNA (double stranded) much smaller than
the bacterial chromosome that include replication
origins (ori sequence) needed for replication in bacterial
cells that naturally carry DNA between different
bacteria. An example of a typical E. coli cloning vector is
pUC19 (2,686bp). The more modern version of pUC19 is
pBluescript II. The features of this plasmid are shown in
Figrue 4.2.
More information about cloning DNA in plasmid
vectors can be found in Molecular Cell Biology, 4th
edition, Section 7.1. This can be downloaded from NCBI
by clicking on the link. The use of simple cloning vectors
CONCEPTS OF GENOMIC BIOLOGY
Page 4-6
Figure 4.3. The features of pUC19 and pBluescrip II
include:
1) High copy number in E. coli, with nearly 100 copies
per cell, provides a good yield of cloned DNA.
2) Its selectable marker is ampR.
3) It has a cluster of unique restriction sites, called
the polylinker (multiple cloning site).
4) The polylinker is part of the lacZ (b-galacto-sidase)
gene. The plasmid will complement a lacZmutation, allowing it to become lacZ+. When DNA
is cloned into the polylinker, lacZ is disrupted,
preventing complementation of the lacZ- from
occurring.
5) X-gal, a chromogenic analog of lactose that turns
-galactosidase is present, and remains
white in the absence of -galactosidase, so bluewhite screening can indicate which colonies
contain recombinant plasmids.
to clone recombinant DNA made via the use of DNA
restriction and overhanging sticky ends can be seen in
the attached Steps in DNA Cloning video. The use of
simple cloning to obtain a collection of clones
representing all sequences that can be cut from a longer
piece of DNA is called creating a clone library (see video)
of sequences. Libraries can be useful in several ways.
One of these might be to create a expression library that
makes specific proteins from each clone. This requires
an expression vector.
4.2.2. Expression Vectors
(RETURN)
Expression vectors contain all of the same elements
that simple cloning vectors contain, i.e. an ori, a
selectable marker, and a multiple cloning site; but the
CONCEPTS OF GENOMIC BIOLOGY
Page 4-7
MCS is flanked by a promoter sequence, and a
terminator sequence that works in the host organism.
This permits the cloned sequence to be transcribed, and
if the vector contains a Shine-Delgarno sequence (not
shown in Figure 4.4.), to be translated into a protein if
there is an start and stop code word in the sequence.
Note that Figure 4.4. illustrates how the cloned
sequence can insert randomly in two orientations.
However, only one of the orientations will produce a
translatable mRNA. The other orientation will produce
an apparent RNA that will be the complementary strand
of the mRNA (called an antisense RNA). In section 4.5.
dealing with this issue will be considered. t
4.2.3. Shuttle Vectors
(RETURN)
A cloning vector capable of replicating in two or
more types of organism (e.g., E. coli and yeast) is called
Figure 4.4. An example of a simple expression vector.
Figure 4.5. Shuttle vectors like pRS426 can be used to move cloned DNA
into 2 different organisms. In this case, the plasmid moves into E. coli
and Yeast. Note that the vector contains an origin of replication for
yeast (yeast 2 u ARS) and E. coli (ori), a selectable marker gene for E. coli
(ampr) and yeast (Ura3, does not require Uracil for growth as does the
yeast strain used), and a multiple cloning site with a yeast promoter and
terminator on either side. Thus, this shuttle vector can work in both E.
coli and yeast.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-8
a shuttle vector. Shuttle vectors may replicate
autonomously in both hosts, or integrate into the host
genome.
4.2.4. Phage Vectors
(RETURN)
Beside plasmid-based simple cloning vectors, there
are a number of other vectors that are not based on
plasmids. These often have specific uses that take
advantage of their unique properties. Among the types
of non-plasmid vectors, bacteriophage  vectors (shown
in Figure 4.6) are among the most frequently used.
Phage  vectors can be used to make expression
libraries and to convenient for selection of clones as the
bacteriophage lyses cells releasing the contents to the
cell to the medium. Thus RNAs and proteins derived
from the inserted fragment can be investigated using
these vectors.
4.2.5. Artificial Chromosome Vectors
Figure 4.6. Phage  Vector.
(RETURN)
The typical simple cloning vector will accommodate
DNA fragments up to about 3,000 bp in length.
However, there are needs to clone significantly longer
fragments of DNA for study. Typically DNA genomic
sequencing is easiest with the longest fragments
possible. Two vector systems, i.e. BAC vectors (bacterial
artificial chromosome) and YAC vectors (yeast artificial
chromosome), are useful choices for cloning DNA
fragments. In BACs fragments up about 350 kbp
(350,000 bp) can be cloned while in YACs fragments up
1,000,000 bp have been reported. Both of these
methods have been used in the original human genome
sequencing project. However, it was found that YACs
are relatively unstable, meaning that they frequently
self-modified loosing DNA in the process, and thus, they
CONCEPTS OF GENOMIC BIOLOGY
Page 4-9
did not have the stability shown by BACs. Consequently, BACs have emerged as the large cloning vector
of choice.
4.3. METHODS OF SEQUENCE AMPLIFICAION
(RETURN)
With our discussion of restriction endonucleases and
cloning vectors completed. We are now ready to put
these concepts together and show how specific DNA
sequences can be amplified to provide specific DNA
sequences for genetic and genomic studies.
4.3.1. Polymerase Chain Reaction (PCR)
(RETURN)
Polymerase Chain Reaction or PCR is a method by
which DNA polymerase can be used to make many
copies of a DNA sequence in a test tube. The technique
is a valuable supplement to DNA cloning to generate
specific DNA sequences for use as reagents.
Figure 4.7. Artificial Chromosome vectors. a) Shows a
bacterial artificial chromosome (BAC) that has a
selectable marker (chloramphenicol resistance), and a
MCS. However, the ori sequence is replaced by a
single copy F factor origin of replication. b) Shows a
yeast artificial chromosome, including selectable
markers (TRP1 and URA3), a yeast origin of replication
(ARS), and centromere and telomere chromosome
parts. This vector will replicate in yeast cells.
A description of the PCR process is given in the
Polymerase Chain Reaction video. Click the link to view
this video. Some additional things to note are that the
reaction temperature is changed using a device called a
thermal cycler that can rapidly change temperatures
during each cycle. The reaction mixture must have all
necessary components for a PCR reaction including a
thermostable DNA polymerase like the TAQ DNA
polymerase mentioned in the video.
Such DNA
CONCEPTS OF GENOMIC BIOLOGY
Page 4-10
polymerases are obtained from organisms called
extremophiles that grow in very hot water like that
found in geysers (e.g. Old Faithful in Yellowstone
National park) or thermal vents on the floor of the
ocean. The reaction also contains the deoxyNTP (deoxy
nucleotide triphosphates, e.g. dATP, dGTP, dCTP, &
dTTP), and the primers which define each end of the
sequence to be amplified.
DNA sequences amplified via PCR typically contain an
extra A on the 3’-end the molecule, i.e. a single
overhanging 3’-A that makes ligation of the PCR
amplified fragment into a PCR cloning vector much
easier (see Figure 4.6).
pGEM-T Easy PCR Vector
(3015 bp)
DNA Ligase
pGEM-T Easy PCR Vector
+ bp)
(3015
(RETURN)
DNA cloning is the for a number of genomic biology
experiments. Large amounts of DNA are needed for
analysis, sequencing, and numerous experimental
approaches. As we saw above multiple copies of a
known DNA sequence can be made and cloned using
PCR and a PCR vector. However, an alternative is
necessary when the sequence to be cloned is unknown
(i.e. PCR primers cannot be determined). To introduce
this principle we will outline the steps to clone a DNA
fragment of unknown sequence in a simple cloning
vector.
To get multiple copies of a gene or other piece of
DNA you must isolate, or ‘cut’, the DNA from its source
using restriction enzymes, and then ‘paste’ it into a
simple cloning vector that can be amplified in a host cell,
typically E. coli.
The four main steps in PCR DNA cloning are:
+
PCR Amplified DNA
(1191 bp)
4.3.2. Cloning in a Simple Cloning Vector
Step 1. DNA is purified from the donor cells using a
standard DNA purification technique.
pGEM-Teasy+ PCR Amplified
DNA (4206 bp)
Figure 4.8. PCR Cloning vectors. Note that the vector comes linearized
with overhanging 3’-T’s. PCR products typically have single overhanging A’s at their 3’-ends. This provides a convenient way of making
a circular plasmid with the inserted PCR product.
Step 2. A chosen fragment of DNA is ‘cut’ from the
purified genomic DNA of the source organism using
a restriction enzyme.
Recont
CONCEPTS OF GENOMIC BIOLOGY
Page 4-11
Step 3. The piece of DNA is ‘pasted’ into a vector and
the ends of the DNA are joined with the vector DNA
by DNA ligase (joins Okazaki fragments) in the DNA
Figure 4.9. Insertion of restricted DNA into a simple cloning vector.
replication section.
Step 4. The vector is introduced into a host cell,
often a bacterium, by a process called bacterial
transformation. The transformed host cells copy the
vector DNA + recombinant DNA along with their own
DNA, creating multiple copies of the inserted DNA. DNA
that has been ‘cut’ and ‘pasted’ from an organism into a
vector is called recombinant DNA. Because of this, DNA
cloning is also called recombinant DNA technology.
Step 5. The vector DNA is isolated (or separated)
from the host cells’ DNA and purified.
4.3.3. Cloning DNA in Expression Vectors
(RETURN)
In section 4.2., we discussed expression vectors, and
showed that when a restricted DNA sequence is cloned
Figure 4.10. Using PCR to obtain only the forward orientation of
a sequence in an expression vector. Primers are designed with a
restriction site added such that they anneal at each end of the
fragment of interest. Following PCR an amplified fragment will be
produced with a KpnI site at the 5’ end of the intended coding
sequence and a SalI site at the 3’ end. The expression vector is
then opened by cutting with both KpnI and SalI. Since the KpnI
site is closer to the promoter in the expression vector’s MCS,
while the SalI site is closer to the terminator. This construct will
go into the vector in the sense orientation so that a message is
produced that makes the protein of interest rather than its
antisense equivalent.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-12
in an expression vector, it can be ligated into the vector
in both a “forward” or a “reverse” configuration (Figure
4.4). In the forward configuration the fragment is
positioned so that it makes an mRNA that codes for a
protein, while in the reverse configuration, the DNA
fragment does not make an mRNA, but makes an RNA
from the opposite strand called an antisense RNA.
reverse transcriptase (makes a DNA strand from an RNA
strand) is used to make a first-strand DNA copy of the
mRNA strand.
It is possible using a PCR strategy to insert a DNA
fragment into an expression vector such that it can only
insert in the “forward” orientation. This strategy is
shown in Figure 4.10.
4.3.4 Making complementary DNA (cDNA)
(RETURN)
A double stranded DNA copy of an mRNA is called a
cDNA. Making cDNA is a way to convert a relatively
labile single-stranded RNA into a relatively stable
double-stranded DNA. It is possible to make a DNA copy
of an RNA by employing an enzyme involved in replication of certain viruses called reverse transcriptase. The
other aspect of Eukaryotic mRNAs that makes producing
cDNAs relatively facile is the polyA tail as we will see
below. cDNAs can be made in several ways, but the
method described here is a traditional method.
Step 1. Total RNA is extracted from cells using a
standard technique for the organism in question.
Step 2. An oligo-dT primer is hybridized with the
polyA tail of a Eukaryotic mRNA. Then an enzyme called
Figure 4.11. The process for making cDNA in a simple cloning vector.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-13
Step 3. The RNA is then partially degraded with
RNase H, and RNA fragments are randomly annealed to
the newly made DNA strand. These RNA fragments act is
primers for DNA polymerase I.
Step 4. DNA polymerase I is then used to make a
complementary DNA strand, and replace the RNA
primers with DNA nucletoides.
Step 5. All pieces are then ligated together using
DNA ligase. Completing the synthesis of a double
stranded DNA copy of the mRNA.
At completion of the procedure above you will have
prepared a cDNA copy of each mRNA that was present
in the cells from which you extracted the RNA. If there
were 10,000,000 polyA tails on 10,000,000 mRNAs you
should make 10,000,000 cDNAs. In other words if there
were 10,000 mRNAs in the preparation that coded for a
given protein like myosin, but only 500 mRNAs coding
for hexokinase and 10 mRNAs for tyrosyl-tRNA
synthetase, you might expect that your cDNA library of
sequences obtained from the cells you used would have
10,000, 500, and 10 cDNAs for the 3 proteins
respectively. The frequency of occurrence of each mRNA
is represented by the frequency of cDNAs in the cDNA
library obtained from a given set of cells. Thus,
information about the frequency of occurrence of
mRNAs in cells can be obtained from analysis of such a
cDNA library. A similar cDNA library from different cells
(e.g. different tissues, or cells treated with a drug, or
grown in a different environment, etc.) will show
different levels of each cDNA present based on the
mRNAs found in a tissue. The frequency of mRNAs
found in a tissue is considered information about the
expression of a gene. Gene expression information
relates directly to the function of transcription
machinery in cells, and is critical functional genomic
information, as we will see in a subsequent section of
the book.
In order to store and subsequently utilize a cDNA
library it is useful to produce a clone of each sequence in
the library. Typically this involved putting the cDNAs
into vectors, and putting the vectors into host cells,
typically E. coli such that each cell gets a single cDNA
which is amplified in that cell and all it’s clones.
4.3.5. Cloning a cDNA Library
(RETURN)
A cDNA clone library is a useful tool to identify
specific mRNAs found in a tissue and to obtain the
sequences of identified genes. To do this a cDNA clone
library (i.e. to clone all cDNAs into a vector, and put one
vector containing an individual cDNA in each cell) can be
created. These cells can be screened to determine
which clones express genes of interest.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-14
Various types of vectors can be used to create a
cDNA clone library. These include phage expression
vectors, plasmid expression vectors, or shuttle vectors
depending on the intended use of the clone library. We
will look at a protocol for incorporation of cDNA into a
plasmid expression vector, using a simple strategy. Note
that kits are now available that provide everything you
require and outline specific strategies for most types of
vectors should you ever need to accomplish this task.
Step 4. Digest the cDNAs with internal sites
protected and linkers attached with the restriction
enzyme to generate the appropriate overhanging sticky
ends).
Step 3
Step 1. Prepare a cDNA library as outlined in section
4.3.4.
Step 2. Manipulating the cDNAs so that each one
has a unique (not contained in any cDNA) restriction site
at both ends. To do this, the cDNAs are frequently
methylated with a specific methyl transferase that
incorporates a methyl group into particular restriction
sites to protect them from the restriction enzyme that
will be used later.
Step 4
Step 5
Step 3. A synthetic double stranded oligonucleotide
linker is then ligated to the ends of this cDNA. The linker
should correspond to a restriction site in the MCS of the
vector to be used. Blunt end ligation is generally a low
efficiency process; but, by using a high concentration of
these synthetic oligonucleotides, it is possible to drive
the reaction to near completion.
Figure 4.12. Procedure of inserting a cDNA into a cloning
vector involving ligation of linkers on the ends of the cDNA.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-15
Step 5. Mix the digested cDNAs with the predigested
vector, and add DNA ligase to ligate to make cDNA
recombinant vectors
Step 6. Transform the recombinant vectors into host
cells, and grow up clones.
Once the cDNA clone library has been constructed, a
number of strategies can be used to select a specific
clone that contains a gene of interest. Figure 4.11
demonstrates how this could be done if antibodies
against the protein of interest are available. Figure 4.12.
shos a strategy for identifying a specific clone by
complementation of a yeast mutant. Note that for this
technique the cDNA library was constructed in a yeast
shuttle vector.
Because cDNAs are the exons of the gene (parts that
code for proteins) a cDNA clone library can be expressed
in either Prokaryotic or Eukaryotic cells. However, there
are sometimes (but relatively infrequently) complex
issues that keep Eukaryotic cDNAs from expressing
functional proteins in Prokaryotic cells. When this
occurs the shuttle vector approach is necessary to get a
functional protein produced in the library.
cDNA libraries have many uses, but comparisons of
cDNA sequences with sequences of corresponding genes
is one way of demonstrating the positions of introns and
exons in the genomic sequence (see Figure 4.15. By
Figure 4.13. Finding a specific
cDNA clone using an expression
library.
Following
transformation of cells with the
cDNA
expression
library,
transformants with inserts
(white colonies) are selected,
replated, and screened with
antibodies against the protein
of interest. Colonies producing
antigenic proteins are then
tested for the presence of the
protein of interest and the
cDNA insert in that clone is
characterized.
sequencing clones from a cDNA library, so called
expressed sequence tags (ESTs) are determined. The
sequences of ESTs were critical to understanding
CONCEPTS OF GENOMIC BIOLOGY
Page 4-16
DNA
(Gene)
Primary RNA
Transcript
Figure 4.14. Strategy for
identifying cDNA clones for
a gene of interest (ARG1)
using cDNAs (high MW
DNA from (ARG1)yeast
strain. Note the cDNAs
need to be inserted into a
yeast shuttle vector such
that the ARG1 gene will be
propperly expressed and
complement the arg1
mutant in the yeast strain
used.
functional components of genomes as they were being
sequenced.
mRNA (cDNA)
Figure 4.15. Primary RNA Transcript
4.4. GENOMIC LIBRARIES
(RETURN)
A genomic clone library or Genomic Library is a set of
cloned sequences made by cloning the entire genome of
an organism or organelle. One of several ways this can
be done by cutting the genomic DNA with one or more
restriction enzymes, and ligating the pieces into a simple
cloning vector as shown in Figure 4.9. A limitation of
simple cloning vectors is the size of DNA that can be
introduced into the cell by transformation. This presents
problems when you are trying to create a Genomic
Library of a large genome such as that of most
Eukaryotes.
Remember that a genomic library contains all of the
DNA found in the cells of the organism. If you digest
CONCEPTS OF GENOMIC BIOLOGY
Page 4-17
organismal DNA to completion with a restriction
enzyme, ligate those fragments into a plasmid vector
and transform bacterial cells, only a portion of those
fragments will be represented in the final
transformation products. If a gene of interest is larger
that the clonalbe fragment length, then you will not be
able to isolate that gene in tact from a plasmid library.
previously obtained. If this new clone overlaps a portion
of the original clone, then the length of the DNA of
interest is extended by the length of DNA in the second
clone that is not found in the original clone. By
performing these steps successive times, a long distance
map can be obtained. To claify this concept, please view
the Chromosome Walking short video.
But what can be done to increase the probability of
obtaining a clone that contains the entire gene. First you
need to use a vector that can accept large fragments of
DNA. Examples of these are bacteriophage and cosmid
vectors, and the relatively popular yeast artificial
chromosome (YAC) vectors (see Figure 4.7b) and the
bacterial artificial chromosome (BAC) vetors (see Figure
4.7a). While longer fragments of genomic DNA can be
cloned in YAC vectors, these are less stable than the BAC
vectors, making BACs the vectors most frequently used
for genomic cloning.
This technique though has difficulties. First, each
step is technically slow. Second, if you use phage  or
cosmid clones, you might only extend the region of
interest by 5-10 kb in each step of the walk. Finally, if
any of the clones that are obtained contain repeated
sequences, the subclone could lead you to another
region of the genome that is not contiguous with the
region of interest. This is because Eukaryotic genomes
have so called repeated sequence DNA interspersed
throughout their genomes.
4.4.1. Cloning in YAC Vectors
(RETURN)
A goal of genomic sequencing is to obtain physical
data about the genomic organization of DNA in a
genome. Traditionally, this data has been obtained by a
technique called chromosome walking. Walking can
performed by subcloning the ends of DNA inserted in a
phage  vector or cosmid vector and screening a library
for new clones that contain the end-sequences
Yeast artificial chromosomes can alleviate some of
these problems because of the large (100-1000kb)
amount of DNA that can be cloned. Howver, YACs
cannot speed up each step of the walk because the
subcloning and screening steps cannot be accellerated.
But YACs can easily extend the region of interest by 50100 kb and up to as much as 500 kb per walking cycle.
Thus a long distance map of the region can be obtained
in several steps. Secondly, although repetitive regions
CONCEPTS OF GENOMIC BIOLOGY
Page 4-18
may be 10-20 kb in length they are rarely, longer than
50 kb. Thus a YAC with 100kb will contain some region
that is single copy which can be used for further steps in
the walk.
While YACs allow the cloning of the largest fragments
possible, their relative stability has allowed the more
stable BACs, which bear shorter recombinant fragments,
to become the vector of choice for chromosome walking
and subsequent sequencing.
4.4.2. Cloning in BAC Vectors
(RETURN)
During the Human Genome Project, researchers had
to find a way to reduce the entire human genome into
chunks, as it was too large to be sequenced in one go.
To do this they created a store of DNA fragments called
a BAC library, specifically a human genome BAC library.
BAC stands for Bacterial Artificial Chromosome.
These are small pieces of bacterial DNA that can be
identified and copied within a bacterial cell and act as a
vector, to artificially carry recombinant DNA into the cell
of a bacterium, such as Escherichia coli.
In general BAC clones carry inserts of DNA up to
300,000 bp in length. The bacteria are then grown to
produce colonies that contain the same fragment of
DNA in each cell of the colony. This is a BAC clone
library. Individual BAC clone colonies can be stored until
needed.
Making a BAC library
To make a genomic Bacterial Artificial Chromosome
(BAC) library:
Step 1. Isolate the cells containing the DNA you want
to store. For animals BAC libraries come from white
blood cells.
Step 2. These isolated cells are then mixed with
warm agarose, a jelly-like substance. The whole mixture
is then poured into a mold and allowed to cool to
produce a set of small blocks, each containing thousands
of the isolated cells.
Step 3. The cells are then treated with enzymes to
dissolve their cell membranes and release the DNA into
the agarose gel. A restriction endonuclease is used to
chop the DNA into pieces around 200,000 base pairs in
length (partial digestion versus complete digestion
producing smaller fragments).
Step 4. These blocks of gel containing chopped up
DNA are then inserted into holes in a slab of agarose gel.
The DNA fragments are then separated according to size
by electrophoresis.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-19
and inserted into a BAC vector using DNA ligase to join
the two bits of DNA together. This produces a set of BAC
clones.
Step 6. The BAC clones are added to bacterial cells,
usually E. coli, and the bacteria are then spread on
nutrient rich plates that allow only the bacteria that
carry BAC clones to grow. The bacteria grow rapidly,
resulting in lots of bacterial cells, each containing a copy
of a separate BAC clone.
Step 7. After they have grown, the bacteria are then
‘picked’ into plates of 96 or 384 so that each tube
contains a single BAC clone.
The bacteria can also be copied or frozen and kept
until researchers are ready to use the DNA for
sequencing. A BAC library has been created.
4.5. DNA SEQUENCING
Figure 4.16. BAC Vector. Contains blue/white screening capability.
Genomic DNA fragments up to 300,000 bp can be ligated into the MCS
of the vector which also contains a selectable marker and an F’ single
copy origin of replication.
Step 5. Fragments of a particular size class (200,000
to 300,000 bp) selected, removed from the agarose gel
(RETURN)
The original techniques for sequencing DNA
molecules were developed by Fred Sanger in the 1970’s.
Sanger’s method, which we will look at in section 4.5.2,
relies on determining the last nucleotide added as DNA
polymerase is copying a DNA molecule, and then
separating these nucleotides that are but one nucleotide
different in length from each other using a technique
known as electrophoresis.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-20
From Sanger’s original work, the process was
automated, and such robotic sequencers were used to
generate the first human genome sequence obtained by
the original Human Genome Project. Subsequently,
sequencing technology has been dramatically changed
to both lower the cost of sequencing and increase the
speed of sequencing using so called “Next Generation
Sequencing”.
We will look at these techniques in today’s lab.
4.5.1. Electrophoresis
(RETURN)
Nucleic acid electrophoresis is an analytical
technique used to separate DNA or RNA fragments by
size and reactivity. Nucleic acid molecules to be
analyzed are separated in a viscous medium, typically a
gel of some type. An electric field is appled across the
gel causing the nucleic acids to migrate toward the
anode due to the net negative charge of the sugarphosphate backbone of the nucleic acid chain. The
separation of nucleic acid fragments is accomplished by
exploiting the different mobility of different sized
molecules as they are passing through the gel. Longer
molecules migrate more slowly because they experience
more resistance within the gel. Smaller fragments
migrate further in the same time and end up nearer to
the anode than longer ones (see figure 4.17).
Figure 4.17. Electrophoretogram showing the migration of smaller
molecules to the anode (+) at the bottom of the gel,. The molecules to
be separated are loaded at the top of the gel near the cathode (-).
Larger molecules remain near the cathode. On the right side of the gel,
a set of moleucles of known molecular size (length) are run. By
comparing the mobility of an unknown molecule with the molecules of
known length the size of the unknown fragments can be estimated.
For highest reolution of similar sized fragments as
required for DNA sequencing, either the voltage or run
time can be varried. Extended runs across a low voltage
gel yield the most accurate resolution, and sequencing
gels can be 1 m in length.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-21
4.5.2. Sanger Dideoxy Sequencing
(RETURN)
The method of DNA sequencing invented by Fred
Sanger is a truly revolutionary technique. He was
rewarded for his ingenuity with the Nobel Prize in 1980.
The specific steps of Sanger’s method are given
below. Note that you can also view a video that
describes this process:
Step 1. The DNA double helix is ‘denatured’ (broken
down) with heat or chemicals to separate the two
Figure 4.18. a) A regular deoxynucleotide triphosphate (dNTP) with a
3’-OH Group. B) A dideoxynucleotide triphosphate ddNTP. Since ddNTP
have no 3’-OH group it is not possible for DNA polymerase to add more
nucleotides to the growing nucleotide chain and DNA synthesis is
terminated at that base.
strands. These will then act as templates for DNA
synthesis using DNA polymerase and a primer similar to
what is used in PCR.
Step 2. To the mixture of template, primer, DNA
polymerase, dNTP (nucleotide bases (dA, dC, dG and dT)
are added. One or more of these bases is radioactively
labelled so that any DNA that is synthesised can be
detected.
Step 3. Once the sequencing reaction has begun
versions of the dNTP containing a hydrogen atom on
both the 2’ and 3’ carbons of deoxyribose (see Figure
4.18) known as dideoxy-nucletotides (ddNTP) or chain
terminators are also added in small amounts. Four
identical reactions are run at the same, but ddA is added
to one, ddG to the second, ddC to the third, and ddT to
the last reaction. Terminators stop DNA synthesis since
they lack a 3’-OH group for the next nucleotide to fasten
to. So, the 'A' terminator will stop DNA synthesis when
an 'A' base is added (the 'C' terminator will stop DNA
synthesis when a 'C' base is added and so on…)
Step 4. This results in a mixture of pieces of
radioactive DNA of various lengths but all ending in the
same base, i.e. the ddBase added to each reaction.
Step 5. The four different reactions are then loaded
on to separate lanes of an acrylamide gel and the DNA
CONCEPTS OF GENOMIC BIOLOGY
Page 4-22
pieces separated according to size by a process called
electrophoresis (see section 4.5.1).
Step 6. Upon completion
of the electrophoresis, the
radioactively labeled DNA is
then visualized by exposing
the gel to X-ray film. The
radioactively labelled DNA will
make the film turn black at a
position corresponding to it’s
position in the gel. This
exposed film is called an
autoradiogram.
Figure 4.19. Four sequencing reacitons terminated with ddA, ddC,
ddG, and ddT are loaded onto a gel, and after fragments are
separated, an autoradiogram demonstrates the positions of the
fragments with known end nucleotides.
Each band on the film
corresponds to where a
specific ddBase was added in
each of the reactions (ddA,
ddC, ddG or ddT). You can
therefore read off the
sequence of the DNA from
the bottom of the film since
you know the nucleotide that Figure 4.20. A Sanger Didemust be at the end of each oxy sequencing gel showing
for 10 sequence (x4
fragment.
Note that this results
reactions).
technique was very popular in
the day, but it has several major drawbacks including: 1)
the necessity of using radioactivity; 2) eye strain from
CONCEPTS OF GENOMIC BIOLOGY
Page 4-23
reading the gel manually leading to frequent errors; 3)
fragments near the top of the gel cannot accurately be
read, and in general discontinuities in the gel can create
errors; 4) the method was not easily automated because
it was tedious and time consuming. In general with
great effort it was possible to obtain about 500-700 nt
of sequence from most gels, this often took months to
obtain.
Imagine that this the “state of the art” at the time
the Human Genome Sequencing Project began.
Obtaining 3.2 billion bp of human sequence taking 3
man-months per 700 bp would require about 1 million
man-years of labor. Thus, improved technology was
required to make the project successful. Though not
really appreciated by the general population, this
project was the biological equivalent of putting a man
on the moon.
4.5.3. Capillary Sequencing
(RETURN)
Two significant innovations made it possible to
automate DNA sequencing, reduce costs, and increase
efficiency making whole genome sequencing of virtually
any genome a reality.
The first of these innovations was the addition of
fluorescent chromophores to the dideoxy NTPs. These
chromophores are attached such that different
chromophores are attached to each base, and each
chromophore fluoresces at a different color. This means
that only one reaction is needed instead of four, and as
the differently colored ddNTPs terminate the reactions
the molecules will have different fluorescent colors
depending on the terminating nucleotide (see Figure
4.21., left pannel).
The second innovation was the replacement of gel
electrophoresis, with electrophoresis through long thin
acrylic-fiber capillaries (tubes with very narrow pores
through which liquids can pass). These capillaries are far
more uniform and consistent as an electrophoresis
medium, and because they are less temperature
sensitive higher voltages can be employed making
separation faster and more reproducible. Additionally a
laser can be used to generate the fluorescence and this
can be done while the nucleotides remain in the
capillary.
In capillary sequencing machines, DNA fragments are
separated by size through a long, thin, acrylic-fibre
capillary. A sample containing fragments of DNA labeled
with the different chromophores described above is
injected into the capillary. Once the sample has been
injected, an electric field can be applied, to drive the
DNA fragments through the capillary toward the anode
as in gel electrophoresis.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-24
A fluorescence-detecting laser, built into the
automated sequencing machine, then shoots through
the capillary fiber at the end, causing the colored tags
on the DNA fragments, to fluoresce. Each fluorescent
terminator base produces a different color: A = Green, C
= Blue, G = Yellow and T = Red. The color of the
fluorescent bases is detected by a camera as they
migrate through the capilary, and the bases are
recorded by the sequencing machine as the
electrophoresis proceeds. The colors of the bases are
Figure 4.21. On the right is a capillary sequencer trace showing the nucleotides seen by the laser scan. On the left is a classical gel made using
fluorescent nucleotides rather than radioactivity to demonstrate the principle of the cappliary sequencer.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-25
then displayed on a computer as a graph of different
colored peaks (see Figure 4.21., right panel).
This technology is readily amenable to
mechanization, and modern capillary sequences can
dependably run dozens of samples in parallel through
multiple capillaries simultaneously. Also the process is
much faster, and thus multiple runs can be made daily
through each capillary. The robots automating these
sequencers also work 24-7, and data is collected and
stored directly with no tedious human gel read involved.
The human genome took about 10 years to sequence
3.2 billion bases at a cost of approximately $3 billion.
Today we have even faster sequencers that do not
use electrophoresis, and generate sequences even
faster and more inexpensively. This is ….
4.5.4. Next Generation Sequencing
(RETURN)
Next-generation sequencing (NGS) is a fundamentally different approach to DNA sequencing, cutting the
time and cost needed to sequence a genome. Using
capillary sequencing it costs about $1 million to
sequence 1 million bp, and it took about 10 years to
sequence the first human genome. NGS costs about
$0.60 per million bp, and can do the job in about 1 day.
The principles of NGS are in some ways similar to
capillary sequencing where the bases of a small section
of DNA are identified and recorded. However, rather
than being limited to just a few DNA fragments, nextgeneration sequencing extends this process so that
millions of samples can be sequenced, all at the same
time. For this reason it is sometimes called massively
parallel sequencing (MPS). As a result, large amounts of
DNA can be sequenced at rapid speed. With some nextgeneration sequencing machines researchers can
sequence more than five human genomes per machine
in just under a week.
Next-generation sequencing gives scientists the
ability to compare the genomes of many different
individuals. With the latest technologies, we can study
the genomes from all sorts of people to provide us with
the data needed to compare them and uncover the
genetic causes of cancer, diabetes, schizophrenia and
other diseases. We can also explore the genomes of
things that cause human disease such as viruses,
bacteria and other pathogens.
There are at least 4 different NGS sequencing
technologies.
Each has it’s advantages and
disadvantages, but 2 technologies have emerged as the
most useful, e.g. Illumina sequencing-by-synthesis, and
the Roche 454 sequencing technology. All of the NGS
sequencing technologies share several features as
illustrated a video (click link); these are:
CONCEPTS OF GENOMIC BIOLOGY
Page 4-26
1. Sample preparation. Fragments of uniform
length are generated and adapter sequences are
ligated onto the ends of the fragments.
2. Attachment of sequences to a matrix using a
technique called “bridge PCR” that amplifies a
sequence in a specific region of the support
matrix in a cluster. This produces millions to
billions of sequence locations where specific
clusters of sequences are attached to a solid
support matrix.
3. Raw sequence data collection is accomplished by
various techniques depending on the particular
technology that is employed. In general the data
collection process records the sequence being
generated from each cluster at each of the
millions of locations on the matrix simultaneously,
and saves these sequences for subsequent
analysis.
Each sequencing technology involves different
chemistry leading to the generation of sequences. The
specific chemistries that can be used include:
pyrosequencing chemistry used by Roche 454
Sequencers, sequencing-by-synthesis chemistry used by
Illumina sequencers, ion semi-conductor sequencing
used by Ion Torrent Sequencers, and sequen-cing-byligation used by ABI SoLID sequencers (this technology is
longer available although it is described in the video
above).
Note that each of these sequencing technologies,
delivers millions to billions of base paris of reads in a
relatively short period of time (days), and does so at
varying, but relatively low costs per base sequenced.
Read length varies according to the technology used,
but is typically 100 to 400 bases are obtained per read.
The data generated are very large data files that must be
used to generate the longer genomic or cDNA
sequences that are biologically meaningful.
NGS technology regardless of type has revolutionized
DNA sequencing, but simultaneously places a burden on
available computational technology in order to assemble
billions of short reads into whole genomic sequences.
Nevertheless, the ability to generate such massive
amounts of sequence has made this very successful
technology.
4.5.5. Third Generation Sequencing
(RETURN)
Although this technology is emerging, it could soon
be a reality further advancing the role of DNA
sequencing in all branches of the life sciences.
With third generation sequencing, sequencing a
genome will become a cheaper, faster and more
sophisticated process. No sooner had next-generation
CONCEPTS OF GENOMIC BIOLOGY
Page 4-27
sequencing reached the market than a third generation
of sequencing was being developed.
SMRT, Escherichia coli has now been sequenced to an
accuracy of 99.9999 per cent!
One of these new technologies was developed by
Pacific Biosciences and is called Single-Molecule
Sequencing in Real Time (SMRT). This system involves a
single-stranded molecule of DNA that attaches to a DNA
polymerase enzyme. The DNA is sequenced as the DNA
polymerase adds complementary fluorescently-labelled
bases to the DNA strand. As each labelled base is added,
the fluorescent color of the base is recorded before the
fluorescent label is cut off. The next base in the DNA
chain can then be added and recorded.
Sequencing the human genome in this way won’t be
possible for a while, but when it is, scientists predict that
it will be possible to sequence an entire human genome
in about an hour. Imagine the clinical applications of this
technology. A doctor or pharmacist may be able to
identify a critical gene that leads to an accurate drug
prescription by sequencing your genome in the office
while you wait
SMRT is very efficient which means that fewer
expensive chemicals have to be used. It is also incredibly
sensitive, enabling scientists to effectively ‘eavesdrop’
on DNA polymerase and observe it making a strand of
DNA.
SMRT can generate very long reads of sequence (1015 kilobases) from single molecules of DNA, very quickly.
Producing long reads is very important because it is
easier to assemble genomes from longer fragments of
DNA.
With the introduction of such sensitive and cheap
sequencing methods scientists can now begin to resequence genomes that have already been sequenced
to achieve a higher level of accuracy. For example, using
Figure 4.20. A graph showing how the speed of DNA sequencing
technologies has increased since the early techniques in the 1980s.
Image credit: Genome Research Limited.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-28
4.6. DNA SEQUENCING STRATEGIES
(RETURN)
Beyond the method for generating DNA sequences, it
is necessary to have a strategy for how to emply DNA
sequencing technology. Strategies for DNA sequencing
depend on the features and size of the genome that is
being sequenced and the available technology for doing
the sequencing. As part of the Human Genome Project
two general approaches emerged as most useful and
valuable. One of these strategies the Map-based
approach was employed by the publicly funded
sequencing effort that involved scientists from around
the world. The other strategy that was developed by a
privately funded group at Celera Genomics, called whole
genome shotgun sequencing was perhaps faster and
cheaper than the map-based approach, but does not
work efficiently with large genomes though it is very
useful for smaller genomes. In fact today these
approaches are “hybridized” or combined to obtain the
advantages of both strategies.
4.6.1. Map-based Sequencing
(RETURN)
The map-based or clone-contig mapping sequencing
approach was the method originally developed by the
publically funded Human Genome Project sequencing
effort. The rationale for this method is that it is the
“best” method for obtaining the sequence of most
eukaryotic genomes, and it has also been used with
those microbial genomes that have previously been
mapped by genetic and/or physical means. Though it is
relatively slow and expensive, this method provides
dependable high-quality sequence information with a
high level of confidence.
In the clone-contig approach, the genome is broken
into fragments of up to 1.5 Mb, usually by partial
digestion with a restriction endonuclease (section 4.1),
and these cloned in a high-capacity vector such as a BAC
or a YAC vector (section 4.2.5). A clone contig map is
made by identifying clones containing overlapping
fragments bearing mapped sequence markers. These
markers were originally identified using a combination
of conventional genetic mapping, FISH cytogenetic
mapping, and radiation hybrid mapping. Subsequently,
common practice is to use chromosome walking as an
approach to making a clone-contig library using this
approach sequence markers are generated from BACends, and a map of BAC-end sequences is subsequently
made. Ideally the cloned fragments are anchored onto a
genetic and/or physical map of the genome, so that the
sequence data from the contig can be checked and
interpreted by looking for features (e.g. STSs, SSLPs,
RFLPs, and genes) known to be present in a particular
region.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-29
Once the clone library and contig map have been
developed, relevant clones are sequenced, using
shotgun method below. These sequenced contigs are
then alighned using the markers and overlapping
seuqences on the clones to position each clone.
Figure 4.22. Clone contig mapping of a series of YAC clones conaining
human DNA.
4.6.2. Whole Genome Shotgun Sequencing
Figure 4.21. Schematic diagram of sequencing strategy used by the
publicly funded Human Genome Project. The DNA was cut into 150
Mb fragments and arranged into overlapping contiguous fragments.
These contigs were cut into smaller pieces and sequenced completely..
(RETURN)
In the whole genome shotgun approach, smaller
randomly produced fragments (1,500-2,000 bp) were
produced, cloned, and sequenced. These sequences
were then assembled based on random overlap into a
genome sequence. Typically, some regions are not well
sequenced, and specific sequencing is done to fill in the
gaps that cannot be assembled from the randomly made
pieces.
CONCEPTS OF GENOMIC BIOLOGY
Page 4-30
sequence. This might seem trivial, but duplications
seldom retain their original sequences. They tend to
develop SNPs over time, and this can generate
difficulties in the proper assembly of these duplicated
sequences.
Figure 4.23. Schematic diagram of sequencing strategy used by Celera
Genomics. The DNA was cut into small pieces and sequenced
completely. These fragments were organized into contigs based on
overlapping sequences.
The shotgun method is faster and less expensive
than the map-based approach, but the shotgun method
is more prone to errors due to incorrect assembly of the
random fragments, especially in larger genomes. For
example, if a 500 kb portion of a chromosome is
duplicated and each duplication is cut into 2kb
fragments, then it would be difficult to determine where
a particular 2 kb piece should be located in the finished
Which method is better? It depends on the size and
complexity of the genome. With the human genome,
each group involved believed its approach was superior
to the other, but a hybrid approach is now being used
routinely. The advent of next generation sequencing
allows the use of fragment-end short read sequencing
with much more powerful computer-based assemblers
generating finished sequences. However, the method
still requires at least some second round sequencing to
obtain a completely sequenced genome.
4.7. GENOME ANNOTATION
(RETURN)
Once a genome sequence is obtained via sequencing
using one or more strategies outlined in the preceding
sections. The hard work of deciding what the sequence
means begins. Typically to make such tasks easier some
type of database is created that ultimately shows the
entire sequence, the location of specific genes in that
sequence, and some functional annotation as to the role
that each gene has in an organism. The databases at
CONCEPTS OF GENOMIC BIOLOGY
Page 4-31
NCBI are a critical repository for these types of
information, but there are many other specific and
perhaps more detailed repositories of this type of
information.
errors that were made. As the programs are used they
refine and improve their predictive power.
The
process
routinely
begins
with
the
implementation of what is termed a Gene Finding
bioinformatic pipeline. The separate parts of such a
pipeline are described below.
Once putative coding genes are predicted, the next
step is to compare the predicted mRNA (cDNA)
sequences with known coding sequences, in publically
available libraries.
4.7.1. Using Bioinformatic Tools to Identify Putative
Protein Coding Genes (RETURN)
This can be done with a number of possible tools, but
one of the best for doing this is the Basic Local
Alignment Search Tool (BLAST) utility at NCBI. By taking
your predicted peptide and/or nucleotide sequence and
submitting it to a BLAST search of the nr (proteins) or nt
(nucleotide) sequence database you can learn what
sequences available at NCBI are most similar to your
sequence. When you do a BLASTP (protein) comparison,
you are also shown conserved domains found in your
protein.
A first approximation of gene locations in the
genomic sequence is usually made using a gene
prediction program to predict gene beginning and
ending points, transcriptional and translational start and
stop sites, intron and exon locations, and polyA addition
sites. Often such programs produce sequences of the
putative transcript produced, and/or the mature mRNA
and protein amino acid sequence coded for by the gene
as well.
Many gene prediction programs are so called neural
network programs that are capable of “learning” what
algorithms to use to decide the sequence of a gene.
Such programs are trained on known sequences, and
then once trained used to predict gene regions, and
then after predicting, input is given back concerning
4.7.2. Comparison of predicted sequences with known
sequences (at NCBI) (RETURN)
Recall that conserved domains are amino acid
sequences that are conserved in various types of
proteins. Thus, BLAST searches can inform you a
number of interesting and useful sequence features that
are found in your submitted sequence. Also note that if
a cDNA sequence library or libraries is/are available
from the organism you are working with, and if a related
sequence from a previously cloned gene is available at
CONCEPTS OF GENOMIC BIOLOGY
Page 4-32
NCBI you can also learn about previously known cDNA
or other sequences found in all of the databases at NCBI
from this BLAST search. This becomes a critical method
for learning what your gene does.
Also note that if you are working with a rare
organism where little sequence information is available,
you can construct and sequence your own cDNA library,
to provide information about protein coding genes in
your organism.
The other things you can learn from inspection of the
predicted cDNA sequence and the actual sequence
found in databases is how accurate the prediction was
that was made by the prediction program. This can lead
to editing the predicted gene to show the actual
sequence that is found by BLAST searching when this is
appropriate based on the available data.
As we learn more information about each gene,
more literature is published related to your gene, and
appears in the PubMed database at NCBI or in other
NCBI databases. Since you have an interlocking series of
databases at NCBI, the BLAST search itself gives you
access to a large body of information about sequences
related to your predicted sequence and to the actual
gene that you discovered in the genome that was
sequenced.
4.7.3. Published Genomes
(RETURN)
Once such preliminary analyses have been
performed the data needs to be shared with the
applicable communities (scientific, medical, clinical,
students, the interested public, etc) to whom the
information is useful. The Genomes database at NCBI is
a resource where this is done.
Note that genomic databases at NCBI and elsewhere
are continually evolving, and new information is added
as it comes available. This can make it difficult to
understand what you find, but with care you can follow
the process and wind up with the best information
available.