Download Chapter 4 Sequencing DNA and Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

DNA polymerase wikipedia , lookup

Genome evolution wikipedia , lookup

Mutation wikipedia , lookup

Designer baby wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

RNA-Seq wikipedia , lookup

Cancer epigenetics wikipedia , lookup

DNA damage theory of aging wikipedia , lookup

Replisome wikipedia , lookup

Pathogenomics wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Nucleosome wikipedia , lookup

Gel electrophoresis of nucleic acids wikipedia , lookup

SNP genotyping wikipedia , lookup

Whole genome sequencing wikipedia , lookup

DNA barcoding wikipedia , lookup

DNA sequencing wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Primary transcript wikipedia , lookup

Microevolution wikipedia , lookup

Gene wikipedia , lookup

DNA supercoil wikipedia , lookup

Human genome wikipedia , lookup

DNA vaccination wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Molecular cloning wikipedia , lookup

Epigenomics wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Nucleic acid double helix wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

United Kingdom National DNA Database wikipedia , lookup

History of genetic engineering wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

Microsatellite wikipedia , lookup

Genomic library wikipedia , lookup

Point mutation wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Genome editing wikipedia , lookup

Metagenomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genomics wikipedia , lookup

Transcript
Chapter 4 – Sequencing, Databases
Chapter 4
Sequencing DNA and Databases
Introduction
One of the most remarkable scientific advancements in history is the molecular biology
revolution. In 1953 James Watson and Francis Crick proposed a molecular structure for DNA,
which Oswald Avery had previously shown to be the genetic material. The next question was to
determine how this genetic information coded for the proteins that carry out cellular functions.
Scientists therefore wanted to examine the sequences of the DNA they were working with. The
first DNA sequences were determined by very laborious methods that generated relative short
sequences. Rapid DNA sequencing methods were developed in the mid 1970's which allowed
scientists to generate more sequence data. Less than four decades later, the technology has
moved so quickly that the genomic nucleotide sequences of numerous organisms, including
mouse and humans have been completed. For example, in July 1995, the first entire genomic
sequence of an organism, the bacterium H. influenzae was published in Science. There are
1,830,137 base pairs with 1727 predicted genes. This article has 40 authors (contributing
scientists)! Since then, the complete genomic sequences have been determined for 1167
microbes, and the list grows by a genome every 1-2 weeks
(http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi).
In April 1996, the sequence of the entire genome for the first eukaryotic organism, the yeast
Saccharomyces cerevisiae, was completed. Saccharomyces cerevisiae has 16 chromosomes
comprised of a total of 12,068,000 base pairs. It is estimated that there are 5,885 proteinencoding genes. The genome sequence of the C. elegans nematode was published in 1998
(Science 1998, vol. 282, p. 2012). The authors sequenced over 97,000,000 bases, identifying an
estimated 19,099 predicted protein-coding genes. The Drosophila genome was finished in
1999. The human genome has been completed. In 6/07, 186 eukaryotic genomes were being
sequenced. In 6/09 there were 581 genomic projects, with 264 that are in the process of being
assembled and 27 are functionally complete, including a number of fungi, Drosophila (fruit fly),
C. elegans (roundworm), Arabidopsis (a weed that is used for basic plant research), rice, mouse
and humans (see http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi for a list of the genomes).
Genomic projects of chimpanzee, cow, dog, pig, chicken, sea squirt, pufferfish, along with
many others, are in the process of being assembled. The growth of GenBank, the sequence
database at the National Institutes of Health (NIH), is exponential; currently about 1 billion base
pairs of DNA are deposited monthly, and that rate has doubled every year for the past several
years. There are currently 100 billion bases from 165,000 different organisms stored in the US,
European and Japanese Sequence databases. Researchers predict that in the future it should take
less than a day to determine the entire sequence of a microbe and maybe as little as several
weeks to determine the sequence of a human.
This year we will be cloning, sequencing, and analyzing cDNA from the duckweed Landoltia
punctata. Since there is only limited genomic sequence information available on this organism,
it is likely that every sequence you generate will be novel. To help contribute to the scientific
community and help provide information on the relatedness between different species we would
like to add the sequence information that you generate to the international genomic databases.
 2014 WSSP
4-1
Chapter 4 – Sequencing, Databases
Since other scientists will be using and relying on your data, it is essential that you analyze the
quality of your sequence before this data is submitted to the databases. We will therefore spend
a significant effort going over how to interpret your sequence information. In this chapter we
will first go over some of the background for how DNA sequencing is performed.
I. DNA Sequencing Theory
The method of DNA sequencing
most commonly used today is the
enzymatic method originally
developed by Sanger in 1977. This
method is commonly referred to as
dideoxy or chain termination
sequencing. In this method a short
oligonucleotide primer is hybridized
to the DNA template that is to be
sequenced (Fig 4-1). A DNA
polymerase is then used to initiate
DNA synthesis extending from the
primer in the 5’ to 3’ direction. The
synthesized DNA is complementary
to the template strand of DNA. The
reaction contains deoxynucleotides
(dNTPs: dATP, dCTP, dGTP, TTP)
that the polymerase uses to extend the
chain. However, the reaction also
Fig 4-1 Diagram of DNA sequencing using the
contains a small quantity of
chain termination method. All the DNA strands
dideoxynucleotides (ddNTPs) (Fig 4initiate at the same position and are terminated by
2). The ddNTP nucleotides are
the addition of labeled ddNTPs. The different size
lacking the OH group at the 3’
chains can be separated on a gel or column and the
position. In DNA synthesis, each
sequence determined.
new nucleotide is added to the 3' OH
group of the last nucleotide added. However, once the
polymerase incorporates a ddNTP onto the end of the
chain it can not be further extended (Fig 4-3). Since the
incorporation of the ddNTP is random, some DNA chains
will be terminated near the beginning of the synthesis,
while other will be extend further and terminate at other
positions. Only a small percentage of the chains become
terminated at any particular base. Because all four
deoxynucleotides are present in the reaction, chain
elongation proceeds normally until, by chance, DNA
polymerase inserts a dideoxy nucleotide (shown as colored
letters in Fig 4-1). If the ratio of deoxynucleotide to the
Fig 4-2 The difference between
dTTP and ddTTP is the presence
dideoxy versions is high enough, some DNA strands will
succeed in adding several hundred nucleotides before
of an H at the 3’ position.
insertion of the dideoxy version halts the process. At the
 2014 WSSP
4-2
Chapter 4 – Sequencing, Databases
end of the incubation period, the fragments are
separated on a gel or column by length from
longest to shortest (Fig 4-1). The resolution is
so good that a difference of one nucleotide is
enough to separate that strand from the next
shorter and next longer strand.
Sequencing instruments have been developed
to automate this process. These machines use
different florescent labels on each of the bases
Fig. 4-3. Top strand: The deoxy-C at the end
of the chain allows the addition of the next
in the reaction to detect the DNA fragments as
base (A). Bottom strand: The dideoxy-C at
they electrophorese off the gel or column.
the end of the chain prevents the addition of
Since four different dyes are used, all the
the next base, terminating the chain. From
reactions can be done in a single tube, thus
Pearson Education Inc
increasing throughput. Sensitive lasers scan
the bottom of the gels and record the
nucleotides that migrate off the gel. The figure below shows that fluorescent patterns of a
sequencing run (Fig 4-3). Each peak corresponds to a different base in the DNA sequence. The
sequence above the waveforms is the DNA sequence that has been interpreted by the
instrument. The waveform in the example is fairly clean and there are no ambiguities.
However, not all your sequences will be this straight forward!
Fig 4-4 An example of a DNA sequence waveform. Each peak represents a chain termination
at a particular bp in the sequence. The color of the peak represents the specific fluorescent dye
detected. The letters above indicate the sequence determined by the instrument.
II. The use of computers in DNA sequence analysis.
A. Introduction
As discussed in the first part of this chapter, the development of fast and inexpensive methods
has caused an explosion in the number of DNA sequences that have been determined. The
growth of GenBank, the sequence database at NIH, is exponential; currently about 1 billion base
pairs of DNA are deposited monthly, and that rate has doubled every year for the past several
years. There are currently 100 billion bases from 165,000 different organisms stored in the US,
European and Japanese Sequence databases.
 2014 WSSP
4-3
Chapter 4 – Sequencing, Databases
Fig 4-5. Growth of GenBank
Fig 4-6 Growth of DNA sequences. From
Baxevanis, AD (2011) Current Protocols in
Bioinformatics, Wiley Online Library.
With this fantastic success, however, has
come a problem—too much DNA!! Thus,
as DNA sequence information is generated, a problem with storage and analysis of the vast
amounts of information becomes apparent. This type of problem is ideally suited to computers.
Computers serve as tools for handling the vast amounts of sequence information generated by
molecular biologists.
Computers do much more for molecular biologists than just store sequence information.
Programs have also been written which analyze the DNA. For instance, it is important to know
where the protein coding sequences are located on a DNA fragment, what convenient restriction
enzyme sites are present in the DNA fragment, what gene regulatory sequences are present, etc.
Computers can easily and rapidly determine this information.
Another very important job for computers is to find similarities between an unknown DNA or
protein sequence and a known DNA or protein sequence in the databases. At the molecular
level, cells from all organisms function in remarkably similar ways. In fact, it has been shown
that regulatory molecules in a yeast cell can function in the same way in human cells, and vice
versa. Thus, information about a gene in one organism can shed light on a similar gene in
another organism. Computers will take a sequence that you enter and compare it with all the
DNA sequences currently known to find such cross-species homologs. These comparisons have
dramatically increased our understanding of molecular processes in all organisms.
The following is an introduction to DNA sequence analysis using computers. It is important to
understand the terminology as well as to know what is available to you. A scientist who does
not have a working knowledge of computers or is unaware of the available resources cannot
function in today’s research environment.
B. Searching the Databases:
The first thing that anyone wants to know about their DNA sequence is whether it has been
found before. The most straightforward way to accomplish this is to compare the DNA
sequence you have determined with all other DNA sequences present in the molecular biology
 2014 WSSP
4-4
Chapter 4 – Sequencing, Databases
databases. This type of search (nucleic acid by nucleic acid) has several advantages, including
speed.
A nucleic acid-by-nucleic acid database search, however, also has disadvantages. For instance,
let’s say that the gene that encodes your cDNA has never been sequenced before, but a gene
with the same function in another organism has been sequenced. Would these two genes show
homology at the DNA level? The answer is, “not necessarily.” It is known that proteins having
similar functions in different organisms often share significant sequence homology at the level
of the amino acids, i.e., the two proteins have similar amino acids aligned in a similar fashion.
However, because of the degeneracy of the genetic code, the genes encoding these similar
proteins may show little homology on the DNA level. Therefore you should also translate your
DNA sequence into its cognate protein sequence and use the latter to search the Protein
databases. Even if you have already identified your gene by searching the DNA databases, this
type of search is essential because it will help you find proteins from other organisms that have
similar protein sequence. It is possible that there may be little information about the function of
the protein that you have identified. However, closely related homologs may be present in other
organisms, such as Drosophila, C. elegans, yeast or humans, in which a large body of
information has already been obtained on the function and activity of the protein. In some
cases, the three-dimensional structure of one of the homologs may have been determined by
NMR or by X-ray crystallography. However, first let’s talk about DNA-by-DNA searches.
C. Sequence Databases.
A database is a group of related records that are stored by computers. For instance, a catalog
store might want to set up a database for all its customers. In this case, each record might
contain a customer’s name, address, and phone number. The complete list of such records
would comprise the database. Numerous computer programs have been developed which
manipulate such databases in extremely powerful ways.
Databases for molecular biologists contain information pertaining to sequence, structure, and
function of biological molecules. There are two major types of databases in molecular
biology— those that contain DNA sequence information and those that contain protein
sequence information. You will be expected to understand both types of databases.
DNA and protein databases come in several different varieties. The reason for this is mostly
historical. Laboratories across the world realized at about the same time that computers would
be necessary to analyze all the information coming on-line in molecular biology. These
individual labs spearheaded efforts in their various countries to begin biological databases. As a
result, DNA and protein databases were developed in the United States, Europe, and Japan. In
each case, institutions were set up to maintain and update the databases. These separate
databases still exist, but they are no longer isolated. Thus, when you do a database search
today, you generally search all existing databases, not just the one present in your own country.
The combined sequence information of all the databases is referred to as the non-redundant (or
nr) database. However, there also databases with specific types of DNA sequences that can be
searched as well.
 2014 WSSP
4-5
Chapter 4 – Sequencing, Databases
Each record in all the various databases contains essentially the same information. Most
important is the actual sequence, i.e., the DNA or protein sequence submitted by the individual
scientist. In addition, each record contains the name of the organism from which it came, the
date it was submitted, the address of the laboratory that did the submission, a reference to a
published paper if available, and a brief description of the sequence.
D. An example of a database entry (DNA sequence):
LOCUS
DEFINITION
AB231879
1383 bp
mRNA
linear
INV 07-JUN-2006
Artemia franciscana mRNA for zinc finger protein Af-Zic, complete
cds.
ACCESSION
AB231879
VERSION
AB231879.1 GI:94966317
KEYWORDS
.
SOURCE
Artemia franciscana
ORGANISM Artemia franciscana
Eukaryota; Metazoa; Arthropoda; Crustacea; Branchiopoda; Anostraca;
Artemiidae; Artemia.
REFERENCE
1
AUTHORS
Aruga,J., Kamiya,A., Takahashi,H., Fujimi,T.J., Shimizu,Y.,
Ohkawa,K., Yazawa,S., Umesono,Y., Noguchi,H., Shimizu,T.,
Saitou,N., Mikoshiba,K., Sakaki,Y., Agata,K. and Toyoda,A.
TITLE
A wide-range phylogenetic analysis of Zic proteins: Implications
for correlations between protein structure conservation and body
plan complexity
JOURNAL
Genomics 87 (6), 784-792 (2006)
PUBMED
16574373
REFERENCE
2 (bases 1 to 1383)
AUTHORS
Aruga,J. and Toyoda,A.
TITLE
Direct Submission
JOURNAL
Submitted (10-AUG-2005) Jun Aruga, RIKEN Brain Science Institute,
Laboratory for Comparative Neurogenesis; 2-1 Hirosawa, Wako-shi,
Saitama 351-0198, Japan (E-mail:[email protected],
URL:http://www.brain.riken.go.jp/labs/lcn/, Tel:81-48-467-9791,
Fax:81-48-467-9792)
FEATURES
Location/Qualifiers
source
1..1383
/organism="Artemia franciscana"
/mol_type="mRNA"
/db_xref="taxon:6661"
gene
1..1383
/gene="Af-Zic"
CDS
1..1383
/gene="Af-Zic"
/codon_start=1
/product="zinc finger protein Af-Zic"
/protein_id="BAE94140.1"
/db_xref="GI:94966318"
/translation="MTASLSASVMNPSFIKRESPASATALFVPNQFSAVPNFGFHHVP
SACATEQSSEMLNPFVDNHLRLNDQSNFQGYHHPHHGQIQQHHLGSYAARDFLFRRDM
GLGMGLEAHHTHAAQHHHMFDPSHAAAAAHHAMFTGFDHNTMRLPTEMYTRDASGYAA
QQFHQMGSMAPMAHPASAGAFLRYMRTPIKQELHCLWVDPEQPSPKKTCGKTFGSMHE
IVTHITVEHVGGPECTNHACFWQGCVRNGRAFKAKYKLVNHIRVHTGEKPFPCPFPGC
GKVFARSENLKIHKRTHTGEKPFKCEFEGCDRRFANSSDRKKHSHVHTSDKPYNCKVR
GCDKSYTHPSSLRKHMKVHGKSPPPASSGCDSDENESIADTNSDSAASPSPSSHDSSQ
VQVNHNRPPNHHNLGLGFTNPGHIGDWYVHQSAPDMPVPPATEHSPIGPPMHHPPNSL
NYFKTELVQN"
ORIGIN
1 atgactgcta gtttaagtgc aagcgtgatg aatccaagtt ttataaagag ggaaagtcct
61 gcatcggcta cagccctgtt cgtaccaaac caatttagtg cagtgcctaa ttttggattt
121 caccatgttc ctagtgcttg tgcaactgag caaagtagtg aaatgctgaa cccttttgtg
(Note: the rest of the DNA sequence was deleted to save space)
There are two parts to any sequence file, the annotation and the sequence. The annotation
portion includes items such as the definition (a description of the DNA or protein sequence and
what it codes for), accession number (a unique identification number that allows easy retrieval
from any database) and the CDS (the protein coding sequence). A large clone may encode
 2014 WSSP
4-6
Chapter 4 – Sequencing, Databases
several proteins. The sequence portion is numbered and formatted into units of 10 bases or
amino acids. Note that although there is a protein sequence in this file, it is part of the
annotation.
There is another very important thing to remember about sequences and information found in
these databases. This information is submitted by scientists not unlike you. The accuracy of
these sequences is not guaranteed by the database managers (who are often computer scientists)
and is only as reliable as the scientist(s) who submitted the sequence. As a result, some types of
sequences are very good (e.g. C. elegans genomic sequence has an error rate of <1/10000). In
contrast, sequences in the Expressed Sequence Tag (est) databases can have error rates on the
order of 1%. This difference occurs because the EST projects do not need as highly accurate
sequence data, so rough sequence goes directly into the database, whereas genome sequencing
projects require a high degree of accuracy, so the sequence is checked and rechecked.
E. Available Databases.
A good place to start if you don't know where to find what you want is the National Center for
Biotechnology Information (NCBI) (http://www.ncbi.nlm.nih.gov/). From this page, you can
access almost any sequence, link to any organism’s genome project home page, do searches for
published papers using keywords, and more. As mentioned earlier, there are numerous
databases around the world. Some databases with which you should be familiar are listed here:
General Databses:
GenBank
EMBL
GenEMBL
DDBJ
EST
STS
PIR
SwissProt
Genpept
PDB
DNA sequences (USA database)
DNA sequences (European Molecular Biology Laboratory)
GenBank and EMBL sequences combined
DNA sequences (Japan’s equivalent of Genbank)
Expressed Sequence Tags (USA) (DNA sequences)
Sequence Tagged Sites (USA) (DNA sequences)
Protein Identification Resource (protein sequences)
Protein sequences (Switzerland and EMBL)
Translations of DNA based on authors’ information
Coordinates for protein 3D structure. (Maintained at Rutgers!)
Organismal Specific Databases:
SGD
Saccharomyces Genomic Database
YPD
Yeast Protein Database
WPD
Worm Protein Database
Wormbase C. elegans Genome Database
Sanger
Worm sequence and genomic database
Flybase
Drosophila sequence and genetic database
Human
Many
A list of the eukaryotic genomes that are in the process of being sequenced or that have been
completed can be found at http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi. This site also has
links to the Genomic Center or Consortium that is coordinating the sequencing project.
 2014 WSSP
4-7
Chapter 4 – Sequencing, Databases
F. Search programs used in WSSP.
BLAST— Basic Local Alignment Search Tool (BLAST) is used to search either protein or
DNA sequence databases with your own query sequence. This will be the main search
program you will use in your research. Blast allows you to compare your DNA
sequence with all DNA sequences in the various databases. As mentioned above, this is
the first thing that you will want to do with your cDNA sequence files. Blast will simply
and quickly tell you whether or not your sequence is already in the databases. It can also
tell you if your sequence is similar to other sequences in the databases.
The term Blast actually refers to a series of searching programs. The types of program
you will use depend on whether your query sequence is a DNA or protein sequence and
whether the databases you are searching contain DNA or protein sequences. You will
learn the functions of the various blast programs as well as the various databases that are
available.
blastn: compares a nucleotide query sequence against a nucleotide sequence database
blastp: compares an amino acid (protein) query sequence against a protein sequence
database.
blastx: compares a nucleotide query sequence translated in all reading frames against a
protein sequence database. If you submit a DNA sequence for your blast search, and
specify a protein database, blastx will translate your DNA sequence into the 6 different
reading frames, then compare the translated sequence to the protein sequences in the
protein sequence database.
 2014 WSSP
4-8