Download MGY428- Genomes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Vectors in gene therapy wikipedia , lookup

X-inactivation wikipedia , lookup

Point mutation wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Essential gene wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Gene expression programming wikipedia , lookup

Neocentromere wikipedia , lookup

Gene desert wikipedia , lookup

Copy-number variation wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Chromosome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Genetic engineering wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Oncogenomics wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Polyploid wikipedia , lookup

NUMT wikipedia , lookup

RNA-Seq wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression profiling wikipedia , lookup

Public health genomics wikipedia , lookup

Ridge (biology) wikipedia , lookup

Genomic imprinting wikipedia , lookup

Transposable element wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Gene wikipedia , lookup

Metagenomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genome (book) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

History of genetic engineering wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Non-coding DNA wikipedia , lookup

Pathogenomics wikipedia , lookup

Human genome wikipedia , lookup

Genome editing wikipedia , lookup

Genomic library wikipedia , lookup

Human Genome Project wikipedia , lookup

Genomics wikipedia , lookup

Helitron (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
MGY428- Genomes
John Parkinson
[email protected]
15-704 TMDT East Tower
Hospital for Sick Children
History and availability of genomes
What‟s in a genome
Genome comparisons
Prokaryotes / Eukaryotes
Genome analyses
Finding information on genome projects
http://www.genomesonline.org/
Timeline of genome sequencing
H. sapiens “Finished”
H. sapiens Draft
600
500
C. elegans (1st multicellular organism)
400
S. cerevisiae (first eukaryote)
300
M. tuberculosis (minimum genome)
H. influenzae (first bacteria)
Cumulative
number of genomes
D. melanogaster (1st eukaryote via W.G.S)
200
HIV-1 Genome sequenced
100
Fred Sanger - method of
sequencing DNA
1974
Bacteria
0
1984
1995
1998 2000
Today
Archaea
Eukarya
Many bacteria – fewer eukaryotes
Nematodes
Number of published genomes
2
Bacteria – 526+
Eukaryotes - 66
Archaea - 47
Metazoa
Eukaryotes
8
1
7
1
7
3
2
1
1
12
17
4
Where do I find genome data ?
Well
annotated
'finished'
sequence
Organism specific sites
Yeast - SGD - yeastgenome.org
Plasmodium - plasmodb.org
C. elegans - wormbase.org
Drosophila - flybase.org
Generic genome sites
ENSEMBL - ensembl.org
NCBI - ncbi.nlm.nih.gov/Genomes/
Sequencing centers
Poorly
annotated
'draft'
sequence
Sanger - sanger.ac.uk
TIGR - tigr.org
Wormbase - Caenorhabditis elegans
Focuses on nematodes
Among the best annotated
Updated bi-monthly by expert curators
ENSEMBL
Mainly focuses on vertebrates
NCBI - http://www.ncbi.nlm.nih.gov/Genomes/
Not as well annotated
Covers a wider spectrum of organisms
Genomes as hypotheses
Number of annotated proteins in wormbase
25,000
20,000
15,000
10,000
5,000
0
The C. elegans genome will continue to be annotated as more data
is generated (e.g. Marc Vidals ORFeome project)
Which organism has the largest genome?
H. influenzae
1.8 Mbp
Mudpuppy
50 Gbp
Ancylostoma
100 Mbp
David and Victoria
3 Gbp (each)
Fern
160 Gbp
Elephant
3 Gbp
Amoeba
670 Gbp
Distribution of genome size
Increasing size
More 'complex' organisms do not necessarily have larger genomes
C-value paradox - due to 'junk' (repetitive) DNA
C-value enigma - what causes accumulation of junk ?
Smaller genomes may reflect a parasitic lifestyle
Genome comparisons – gene function in bacteria
For certain functions (e.g. translation / transcription) a basic complement of
proteins is required
For other functions, the number of proteins can vary and may be related to
genome size
Mycobacterium genitalium
Treponema pallidum
Borrelia burgdorferi
Helicobacter pylori
Methanococcus jannaschii
Haemophilus influenzae
Archaeoglobus fulgidus
Analysing gene complements informs on their biology
e.g. bacterial transporters
Both occupy respiratory tract of humans, however two very different strategies
– H. influenzae – employs a battery of redundant processes that allow it to optimize survival.
– M. pneumoniae uses a generalist strategy of maintaining proteins that are more versatile
because of their broad substrate range.
What can we find in a genome ?
Introns / Exons
5`, 3` UTRs
Regulatory regions
Promoters
Enhancers
Repressors etc.
Single nucleotide polymorphisms (SNPs)
Genes !
Protein coding / Ribosomal
tRNAs
rRNAs,
microRNAs
snoRNAs,
snRNAs..)
What can we find in a genome ?
Junk !?
Pseudogenes
Transposons
> 50% of human genome
Parasitic - spread through genome.
Viral origins (most are inactive)
47 types found in human genome
Tandem Repeats
microsatellites (1- 7bp)
(e.g. cacacacacacaca....)
minisatellites (typically <40bp)
satellites (140- 360 bp)
Prokaryotes / Eukaryotes
Prokaryotes - 526 genomes
Many bacteria have been sequenced due to importance
in medicine, agriculture and the food industry
80-90% protein coding
500Kbp-10Mbp
Typically 40-80 tRNAs
600-8000 ORFs
Average ORF size 925 bp
Introns virtually absent; few repetitive sequences; short
intergenic sequences (< Kbp); genes organised as operons
Eukaryotes - 66 genomes
Selected for sequencing on the basis of genome size as well as
importance in fundamental research.
< 2-70% protein coding
5Mbp – 600 Gbp
Typically 200-600 tRNAs
4000-40000 ORFs
Average ORF size 1.3-25 kbp
Many genes have introns; many repetitive sequences; large
intergenic regions (upto many Mbp's); few operons
Bacterial Genome Features – Chromosome organization
Typically possess a single circular chromosome
a few possess >1 chromosome (e.g. Vibrio)
some possess linear chromosomes (e.g. Streptomyces)
some contain both a linear and a circular chromosome (e.g. Agrobacterium)
some plasmids have also been sequenced (e.g. Megaplasmid of D. radiodurans)
contingency
genes
Streptomyces coelicolor
Actinomycete (soil bacteria)
essential
genes
Produces > 2/3 of all natural antibiotics
Sequenced 2002
Linear chromosome
Large for a bacteria
(8.7Mb - 7825 genes)
all genes
G + C content
Bacterial Genomes – Streptomyces coelicolor
Comparison with other related
organisms suggest arms are novel
compared with rest of sequence
Secondary metabolite “factories”
associated with the ends of
chromosome arms
Arisen through gene duplications
Create products aimed at knocking
out other bacteria (soil environment
highly competitive)
Bacterial Genomes – Gene organization
Transcription units are often organised as operons (25% of genes in E. coli)
Bacterial Genome Features – GC Content
Biases in GC content
Different bacterial species demonstrate altered G/C biases
G-C has an extra H-bond compared with AT - biological role ?
e.g. Thermophilic bacteria may require a higher GC content to withstand
higher temperatures
GC content
C. botulinum
26%
H. influenzae
38%
E. coli
50%
T. thermophilus
69%
Coding regions also tend to have higher G/C
biases - can be exploited to find genes
G + C content
Bacterial Genomes – Repetitive elements
Repeat sequences are rarer in prokaryotes than
eukaryotes – in E. coli for example :
Type
IS (simple insertion sequence)
Rhs (recombinational hotspot)
REP (repetitive palindromic sequence)
Chi (cross-over hotspot)
IRU (intergenic repeat unit)
Num. Genome
50
5
581
~1000
19
Size
<2 kb
6-10 kb
38 bp
8 bp
126 bp
% of genome
1.5
0.8
0.5
0.2
0.05
The function of some of these repeats has been identified
Chi sequences are implicated in homologous recombination
REP elements are palindromes and have been implicated in supercoiling
Some of these sequences have been identified in other bacteria
IS elements are common
REP elements have been found in N. meningitidis
Eukaryotic Genomes – Chromosome organization
Genomes organised into chromosomes
Number varies between species with little if any correlation with complexity
S. cerevisiae
C. Elegans
Humans
Crayfish
16
6
23
100
Complementary strands typically have similar numbers of genes, but striking
examples in Leishmania major and related Trypansomes (protozoan parasites)
Thought to be related to specialised transcriptional processes – gene expression
regulated primarily by posttranslational mechanisms ?
Eukaryotic Genomes – Some examples
The first eukaryotic genome
13.4 Mb 70% coding
16 chromosomes – 5570
genes (6300 originally)
2kb per gene
275 tRNAs
Relatively compact - small intergenic
regions
32% of genes have a homolog to
another gene within Yeast (paralog)
S. cerevisiae underwent a whole
genome duplication event
The first multicellular genome
100 Mb 25% coding
6 chromosomes – 20000 genes
5kb per gene
584 tRNAs
25% of genes are organized as
operons
Almost 60% of genes appear
„specific‟ to nematodes
Essential genes are at the center of
chromosomes
The human genome
3.2 Gb 2% coding
23 chromosomes – 25000
genes (?)
100kb per gene
497 tRNAs
Genes appear more „complex‟
More domains, more domain
architectures
More introns and hence more
alternative splicing
- Responsible for our biological
complexity ?
Yeast genes are less dense at the ends of chromosomes
Eukaryotic Genomes – Centromeres and telomeres
Centromeres mediate interactions between sister
chromatids and the kinetochore during replication
In budding yeast centromeres are 125 bp in length and
contain specific sites for binding kinetochore proteins.
In human the centromere is composed of hundreds of
thousands of copies of a 171 bp repeat that directs
heterochromatin assembly that replaces sequence
specific binding sites
Telomeres are found at the end of chromosomes and
are composed of simple tandem repeats which protect
the integrity of the ends
They are dynamic – for many cell types during every
round of replication, they shrink. This limits the
number of times the cell can divide
Eukaryotic Genomes – Introns
Number of introns varies between species
Yeast only 4% of genes have introns, C. elegans „most‟ genes have introns, Human
– all genes have introns (except histones)
Comparisons of conserved genes
across different eukaryotes reveals
interesting patterns of intron gain
and loss.
~80% of introns in humans are
conserved in sea anemone
Fly, worms and sea squirts have
lost ~50-90% of their ancestral
introns
Ancient
intron
conserved
Ancient
intron lost
flies/worms
Ancient
intron lost in
flies; worms;
fungi; sea
squirt
Animal
intron lost
in flies and
worms
Eukaryotic Genomes – Repeats
Eukaryotes contain a high proportion of
repeated sequences these include transposons
and related elements
Transposons are elements which can move
around the genome potentially leading to:
mutations (insertions in genes)
increasing (or decreasing) amount of DNA
Class I (Retrotransposons)
use RNA as an intermediary
LINEs – Long interspersed elements
SINEs – Short interspersed elements
HIV
Class II (Transposons)
- uses only DNA
P elements (Drosophila)
Eukaryotic Genomes – Repeats
Incidence of repeats
LINE/SINE
Retrovirus-type sequences
Transposon type sequences
Total
0.5%
4.8%
5.1%
10.5%
0.4%
0
5.3%
6.5%
4.7%
6.4%
3.6%
14.9%
34%
8%
3%
45%
Retrotransposons and the C-value Paradox
The genome of Arabidopsis thaliana contains 125 Mbp of DNA. This includes a small
number of retrotransposons and about 25,000 functional genes.
The maize (corn) genome contains 20 times more DNA (2.4 Gbp)
50% of the corn genome is made up of retrotransposons.
Most of the 250 Gbp of DNA in the genome of the fern - Psilotum nudum is presumably
"junk" DNA.