Download Overview of genome analysis Page 421

Document related concepts

RNA-Seq wikipedia , lookup

Exome sequencing wikipedia , lookup

Plant virus wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Molecular evolution wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
Completed Genomes:
Viruses and Bacteria
Monday, October 20, 2003
Introduction to Bioinformatics
ME:440.714
J. Pevsner
[email protected]
Copyright notice
Many of the images in this powerpoint presentation
are from Bioinformatics and Functional Genomics
by J Pevsner (ISBN 0-471-21004-8).
Copyright © 2003 by Wiley.
These images and materials may not be used
without permission from the publisher.
Visit http://www.bioinfbook.org
Announcements
We are now beginning the last third of the course:
Today: completed genomes (Chapters 12-14)
Wednesday: Fungi. Exam #2 is due at the start of class.
Next Monday: Functional genomics (Jef Boeke)
Next Wednesday: Pathways (Joel Bader)
Monday Nov. 3: Eukaryotic genomes
Wednesday Nov. 5: Human genome
Monday Nov. 10: Human disease
Wednesday Nov. 12: Final exam (in class)
Outline of today’s lecture
Genome projects (Chapter 12)
chronological overview
major issues and themes
Introduction to viruses (Chapter 13)
classification
bioinformatics challenges and resources
Introduction to bacteria and archaea (Chapter 14)
classification
bioinformatics challenges and resources
Introduction to genomes
A genome is the collection of DNA that comprises
an organism. Today we have assembled the sequence
of hundreds of genomes. We will begin by introducing
the “tree of life” in an effort to make a comprehensive
survey of life forms.
Page 397
Introduction: Systematics
Ernst Haeckel (1834-1919), a supporter of Darwin,
published a tree of life (1879) including Moner
(formless clumps, later named bacteria).
Chatton (1937) distinguished prokaryotes (bacteria
that lack nuclei) from eukaryotes (having nuclei).
Whittaker and others described the five-kingdom
system: animals, plants, protists, fungi, and monera.
In the 1970s and 1980s, Carl Woese and colleagues
described the archaea, thus forming a tree of life
with three main branches.
Page 399
Five kingdom
system
(Haeckel, 1879)
mammals
vertebrates
animals
invertebrates
plants
fungi
protists
monera
protozoa
Page 396
Pace (2001) described a tree
of life based on small subunit
rRNA sequences.
This tree shows the main
three branches described
by Woese and colleagues.
Fig. 12.1
Page 400
Molecular sequences as basis of trees
Historically, trees were generated primarily using
characters provided by morphological data. Molecular
sequence data are now commonly used, including
sequences (such as small-subunit RNAs) that are
highly conserved.
Visit the European Small Subunit Ribosomal RNA
database for 20,000 SSU rRNA sequences.
Page 401
Genome sequencing projects
Genomes that span the tree of life are being
sequenced at a rapid rate. There are several web-based
resources that document the progress, including:
GNN
Genome News Network
http://www.genomenewsnetwork.org/main.shtml
GOLD
Genomes Online Database
http://wit.integratedgenomics.com/GOLD/
PEDANT
Protein Extraction, Description & Analysis Tool
http://pedant.gsf.de/
Page 405
Genome sequencing projects
There are three main resources for genomes:
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk/genomes/
NCBI
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov
TIGR
The Institute for Genomic Research
http://www.tigr.org
Page 405
archaea
bacteria
eukaryota
http://www.ncbi.nlm.nih.gov/Entrez/
Overview of viral complete genomes
Overview of archaea complete genomes
Overview of eukaryota genomes in NCBI’s Entez division
Overview of eukaryota
genomes in NCBI’s
Entrez division
Chronology of genome sequencing projects
We will next summarize the major achievements in
genome sequencing projects from a chronological
perspective.
Page 404
Chronology of genome sequencing projects
1977: first viral genome
Sanger et al. sequence bacteriophage fX174.
This virus is 5386 base pairs (encoding 11 genes).
See accession J02482.
1981
Human mitochondrial genome
16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
Today, over 400 mitochondrial genomes sequenced
1986
Chloroplast genome
156,000 base pairs (most are 120 kb to 200 kb)
Page 406
Entrez nucleotide record for bacteriophage fX174
(graphics display)
Fig. 12.6
Page 407
mitochondrion
chloroplast
Lack
mitochondria (?)
Chronology of genome sequencing projects
1995: first genome of a free-living organism,
the bacterium Haemophilus influenzae
Page 409
1995: genome of the bacterium
Haemophilus influenzae is sequenced
Fig. 12.9
Page 411
Overview of bacterial
complete genomes
You can find functional
annotation through the
COGs database
(Clusters of
Orthologous
Genes)
Fig. 12.9
Page 411
Click the circle to
access the genome
sequence
Fig. 12.9
Page 411
Genes are color-coded
according to the
COGs scheme
Click the circle to
access the genome
sequence
Fig. 12.10
Page 412
Chronology of genome sequencing projects
1996: first eukaryotic genome
The complete genome sequence of the budding yeast
Saccharomyces cerevisiae was reported. We will
describe this genome on Wednesday.
Also in 1996, TIGR reported the sequence of the first
archaeal genome, Methanococcus jannaschii.
Page 413
1996: a yeast genome is sequenced
To place the sequencing
of the yeast genome
in context, these are the
eukaryotes…
Eukaryotes
(Baldauf et al. 2000)
Fungi
Chronology of genome sequencing projects
1997:
More bacteria and archaea
Escherichia coli
4.6 megabases, 4200 proteins (38% of unknown function)
1998: first multicellular organism
Nematode Caenorhabditis elegans
97 Mb; 19,000 genes.
1999: first human chromosome
Chromosome 22 (49 Mb, 673 genes)
Page 413
1999: Human chromosome 22 sequenced
1999: Human chromosome 22 sequenced
49 MB
673 genes
Chronology of genome sequencing projects
2000:
Fruitfly Drosophila melanogaster (13,000 genes)
Plant Arabidopsis thaliana
Human chromosome 21
2001: draft sequence of the human genome
(public consortium and Celera Genomics)
Page 415
2000
Completed genome projects (current)
Eukaryotes: 10
In progress (partial):
Anopheles gambiae
Danio rerio (zebrafish)
Arabidopsis thaliana
Glycine max (soybean)
Caenorhabditis elegans
Hordeum vulgare (barley)
Drosophila melanogaster
Leishmania major
Encephalitozoon cuniculi
Rattus norvegicus
Guillardia theta nucleomorph
Mus musculus
Plasmodium falciparum
Saccharomyces cerevisiae (yeast)
Schizosaccharomyces pombe
Viruses: 1419
Bacteria: 139
Archaea: 36
Page 417
eukaryotes
Overview of genome analysis
[1] Selection of genomes for sequencing
[2] Sequence one individual genome, or several?
[3] How big are genomes?
[4] Genome sequencing centers
[5] Sequencing genomes: strategies
[6] When has a genome been fully sequenced?
[7] Repository for genome sequence data
[8] Genome annotation
Page 418
Fig. 12.11
Page 418
Overview of genome analysis
[1] Selection of genomes for sequencing is based
on criteria such as:
• genome size (some plants are >>>human genome)
• cost
• relevance to human disease (or other disease)
• relevance to basic biological questions
• relevance to agriculture
Page 419
Overview of genome analysis
[1] Selection of genomes for sequencing is based
on criteria such as:
• genome size (some plants are >>>human genome)
• cost
• relevance to human disease (or other disease)
• relevance to basic biological questions
• relevance to agriculture
Ongoing projects:
Chicken
Chimpanzee
Cow
Dog (recent publication)
Fungi (many)
Honey bee
Sea urchin
Rhesus macaque
Page 419
Overview of genome analysis
[2] Sequence one individual genome, or several?
Try one…
--Each genome center may study one
chromosome from an organism
--It is necessary to measure polymorphisms
(e.g. SNPs) in large populations (November 5)
For viruses, thousands of isolates may be sequenced.
For the human genome, cost is the impediment.
Page 419
Overview of genome analysis
[3] How big are genomes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 800 kb)
Bacterial genomes: 0.5 Mb to 13 Mb
Eukaryotic genomes: 8 Mb to 686 Mb
(discussed further on Monday, November 3)
Page 420
Genome sizes in nucleotide base pairs
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
The size of the human
genome is ~ 3 X 109 bp;
almost all of its complexity
is in single-copy DNA.
amphibians
reptiles
birds
The human genome is thought
to contain ~30,000-40,000 genes.
104
105
106
107
mammals
108
109
1010
1011
http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt
Overview of genome analysis
[4] 20 Genome sequencing centers contributed
to the public sequencing of the human genome.
Many of these are listed at the Entrez genomes site.
(See Table 17.6, page 625.)
Page 421
Overview of genome analysis
[5] There are two main stragies for sequencing genomes
Whole Genome Shotgun (from the NCBI website)
An approach used to decode an organism's genome
by shredding it into smaller fragments of DNA which
can be sequenced individually. The sequences of these
fragments are then ordered, based on overlaps in the
genetic code, and finally reassembled into the complete
sequence. The 'whole genome shotgun' (WGS) method is
applied to the entire genome all at once, while the
'hierarchical shotgun' method is applied to large,
overlapping DNA fragments of known location in
the genome.
Page 421
Overview of genome analysis
Hierarchical shotgun method
Assemble contigs from various chromosomes, then
sequence and assemble them. A contig is a set of
overlapping clones or sequences from which a sequence
can be obtained. The sequence may be draft or finished.
A contig is thus a chromosome map showing the
locations of those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to study a
complete, and often large segment of the genome by
examining a series of overlapping clones which then
provide an unbroken succession of information about
that region.
Page 421
Overview of genome analysis
[6] When has a genome been fully sequenced?
A typical goal is to obtain five to ten-fold coverage.
Finished sequence: a clone insert is contiguously
sequenced with high quality standard of error rate
0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain several
regions separated by gaps. The true order and
orientation of the pieces may not be known.
Page 422
Overview of genome analysis
[7] Repository for genome sequence data
Raw data from many genome sequencing projects
are stored at the trace archive at NCBI or EBI
(main NCBI page, bottom right)
Page 425
Fig. 12.14
Page 426
Fig. 12.14
Page 426
Overview of genome analysis
[8] Genome annotation
Information content in genomic DNA includes:
-- repetitive DNA elements
-- nucleotide composition (GC content)
-- protein-coding genes, other genes
These topics will be discussed in detail on
November 3 (eukaryotic genomes)
Page 425
GC content varies across genomes
Bacteria
Number of species
in each GC class
10
5
Plants
5
Invertebrates
3
Vertebrates
10
5
20
30
40
50
60
70
GC content (%)
80
Fig. 12.16
Page 428
Introduction to viruses
Viruses are small, infectious, obligate intracellular
parasites. They depend on host cells to replicate.
Because they lack the resources for independent
existence, they exist on the borderline of the definition
of life.
The virion (virus particle) consists of a nucleic acid
genome surrounded by coat proteins (capsid) that may
be enveloped in a host-derived lipid bilayer.
Viral genomes consist of either RNA or DNA. They may
be single-, double, or partially double stranded. The
genomes may be circular, linear, or segmented.
Page 437
Introduction to viruses
Viruses have been classified by several criteria:
-- based on morphology (e.g. by electron microscopy)
-- by type of nucleic acid in the genome
-- by size (rubella is about 2 kb; HIV-1 about 9 kb;
poxviruses are several hundred kb). Mimivirus
(for Mimicking microbe) has a double-stranded
circular genome of 800 kb.
-- based on human disease
Page 438
Fig. 13.1
Page 439
The International Committee on Taxonomy of Viruses
(ICTV) offers a website, accessible via NCBI’s Entrez site
http://www.ncbi.nlm.nih.gov/ICTVdb/
Fig. 13.2
Page 440
Introduction to viruses
Vaccine-preventable viral diseases include:
Hepatitis A
Hepatitis B
Influenza
Measles
Mumps
Poliomyelitis
Rubella
Smallpox
Page 441
Bioinformatic approaches to viruses
Some of the outstanding problems in virology include:
-- Why does a virus such as HIV-1 infect one species
(human) selectively?
-- Why do some viruses change their natural host?
In 1997 a chicken influenza virus killed six people.
-- Why are some viral strains particularly deadly?
-- What are the mechanisms of viral evasion of the host
immune system?
-- Where did viruses originate?
Page 439-441
Diversity and evolution of viruses
The unique nature of viruses presents special challenges
to studies of their evolution.
• viruses tend not to survive in historical samples
• viral polymerases of RNA genomes typically lack
proofreading activity
• viruses undergo an extremely high rate of replication
• many viral genomes are segmented; shuffling may occur
• viruses may be subjected to intense selective pressures
(host immune respones, antiviral therapy)
• viruses invade diverse species
• the diversity of viral genomes precludes us from
making comprehensive phylogenetic trees of viruses
Page 441
Bioinformatic approaches to herpesvirus
Herpesviruses are double-stranded DNA viruses that
include herpes simplex, cytomegalovirus, and Epstein-Barr.
Phylogenetic analysis suggests three major groups
that originated about 180-220 MYA.
Page 442
Fig. 13.3
Page 443
Bioinformatic approaches to herpesvirus
Consider human herpesvirus 9 (HHV-8). Its genome is
about 140,000 base pairs and encodes about 80 proteins.
We can explore this virus at the NCBI website.
Try NCBI  Entrez  Genomes  viruses  dsDNA
Page 442
Fig. 13.4
Page 444
Fig. 13.5
Page 445
Fig. 13.10
Page 449
Bioinformatic approaches to herpesvirus
Consider human herpesvirus 9 (HHV-8). Its genome is
about 140,000 base pairs and encodes about 80 proteins.
Microarrays have been used to define changes in viral gene
expression at different stages of infection (Paulose-Murphy
et al., 2001). Conversely, gene expression changes have
been measured in human cells following viral infection.
Page 442
Paulose-Murphy et al. (2001)
described HHV-8 viral genes
that are expressed at different
times post infection
Fig. 13.11
Page 450
Bioinformatic approaches to HIV
Human Immunodeficiency Virus (HIV) is the cause of
AIDS. At the end of the year 2002, 42 million people were
infected. HIV-1 and HIV-2 are primate lentiviruses.
The HIV-1 genome is 9181 bases in length. Note that
there are almost 100,000 Entrez nucleotide records
for this genome (but only one RefSeq entry).
Phylogenetic analyses suggest that HIV-2 appeared as
a cross-species contamination from a simian virus,
SIVsm (sooty mangebey). Similarly, HIV-1 appeared
from simian immunodeficiency virus of the chimpanzee
(SIVcpz).
Page 446
Fig. 13.6
Page 446
Bioinformatic approaches to HIV
Two major resources are NCBI and the Los Alamos
National Laboratory (LANL) databases.
See http://hiv-web.lanl.gov/
LANL offers
-- an HIV BLAST server
-- Synonymous/non-synonymous analysis program
-- a multiple alignment program
-- a PCA-like tool
-- a geography tool
Page 453
Fig. 13.13
Page 452
Fig. 13.6
Page 446
Bacteria and archaea: genome analysis
Bacteria and archaea constitute two of the three main
branches of life. Together they are the prokaryotes.
We can classify prokaryotes based on six criteria:
[1] morphology
[2] genome size
[3] lifestyle
[4] relevance to human disease
[5] molecular phylogeny (rRNA)
[6] molecular phylogeny (other molecules)
Page 466
Fig. 14.1
Page 468
M. genitalium has
one of the smallest
bacterial genome
sizes. View its
genome at
www.tigr.org
Fig. 14.2
Page 470
Bacteria and archaea: lifestyles
We may distinguish six prokaryotic lifestyles:
[1] Extracellular (e.g. E. coli)
[2] Facultatively intracellular (Mycobacterium tuberculosis)
[3] Extremophilic (e.g. M. jannaschi)
[4] epicellular bacteria (e.g. Mycoplasma pneumoniae)
[5] obligate intracellular and symbiotic (B. aphidicola)
[6] obligate intracellular and parasitic (Rickettsia)
Page 472
Fig. 14.4
Page 477
Revised
figure
Fig. 14.5
Page 478
Fig. 14.6
Page 479
DNA sequence of both chromosomes
of the cholera pathogen Vibrio cholerae
Nature 406, 477- 483 (2000)
Bacteria and archaea: finding genes
Four main features of genomic DNA are useful:
[1] Open reading frame length
[2] Consensus for ribosome binding (Shine-Dalgarno)
[3] Pattern of codon usage
[4] Homology of putative gene to other genes
Page 480
GLIMMER for gene-finding
in bacteria (www.tigr.org)
Fig. 14.7
Page 482
Lateral gene transfer occurs in stages
Fig. 14.8
Page 484
COGs database:
organisms and tools
COGs database:
functional annotation
COGs database:
distribution of COGs
by number of clades...
COGs database:
distribution of COGs
by number of species
How can whole genomes be compared?
-- molecular phylogeny
-- You can BLAST (or PSI-BLAST) all the DNA and/or
protein in one genome against another
-- TaxPlot and COG for bacterial (and for
some eukaryotic) genomes
-- PipMaker, MUMmer and other programs align large
stretches of genomic DNA from multiple species
Fig. 14.16
Page 493
Fig. 14.16
Page 493
Fig. 14.17
Page 494
Fig. 14.18
Page 495