Download CapeTownGenomes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Comparative genomic hybridization wikipedia , lookup

Point mutation wikipedia , lookup

Segmental Duplication on the Human Y Chromosome wikipedia , lookup

Gene expression profiling wikipedia , lookup

Molecular cloning wikipedia , lookup

Genetic engineering wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genomic imprinting wikipedia , lookup

Cell-free fetal DNA wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Extrachromosomal DNA wikipedia , lookup

Zinc finger nuclease wikipedia , lookup

Copy-number variation wikipedia , lookup

Epigenomics wikipedia , lookup

Adeno-associated virus wikipedia , lookup

Microevolution wikipedia , lookup

Transposable element wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Genome (book) wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

DNA sequencing wikipedia , lookup

Gene wikipedia , lookup

Oncogenomics wikipedia , lookup

NUMT wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Microsatellite wikipedia , lookup

History of genetic engineering wikipedia , lookup

Designer baby wikipedia , lookup

Minimal genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Non-coding DNA wikipedia , lookup

Helitron (biology) wikipedia , lookup

Human genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome editing wikipedia , lookup

Public health genomics wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Genome evolution wikipedia , lookup

Human Genome Project wikipedia , lookup

Metagenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic library wikipedia , lookup

Genomics wikipedia , lookup

Transcript
Genomes
Daniel Lawson
VectorBase @ EBI
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
1
Bioinformatic Tools for Comparative
Genomics of Vectors



Tuesday
 10:30 - 13:00 Genome sequencing
 14:00 - 16:00 Genome annotation
 16:30 - 18:00 Practical
Wednesday

9:30 - 10:00 Review genome annotation
 10:30 - 13:00 Comparative genomics I
 14:00 - 16:00 Comparative genomics II
 16:30 - 18:00 Practical
Thursday

8:30 - 9:00 Review comparative genomics

9:00 - 10:00 VectorBase lecture
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
3
Bioinformatic Tools for Comparative
Genomics of Vectors

Tuesday
 Genome sequencing
 Strategies
 New technologies
 ‘Finished’ versus ‘Accessible’ genomes
 Genome annotation
 Aims and realistic goals
 Genefinding
 Adding value to the gene predictions (descriptions, xref to other data)
 Practical


 Artemis practical
 IGGI assignments
Wednesday
Thursday
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
4
Bioinformatic Tools for Comparative
Genomics of Vectors

Tuesday

Wednesday
 Comparative genomics
 Gene synteny (ortholog/paralog determination)
 Feedback to genome annotation
 Genetrees

Practical

 ACT practical
 IGGI assignments
Thursday
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
5
Bioinformatic Tools for Comparative
Genomics of Vectors


Tuesday
Wednesday

Thursday
 VectorBase
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
6
Some terminology
 Genome
Hereditary information of an organism encoded in the DNA
 Chromosome
Single large macromolecule of DNA
 Contig
Single contiguous section of DNA (a set of overlapping DNA segments derived from a
single genetic source)
 Supercontig (or scaffold)
Ordered (and orientated) assembly of contigs
 Clone
Defined segment of DNA to be used for some purpose
 Expressed sequence tag (EST)
Short sequence of a transcribed spliced nucleotide sequence. Widely used to identify
gene transcripts
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
8
Genome size & complexity
Increasing
complexity
Viruses
Bacteria
Protozoa
Inverterbrates
Issues for consideration when sequencing:
Mammals
Plants
Issues for consideration when annotating:
 DNA source (haplotype issues)
 Genome size
 Genome size
 Repeat content
 Repeat content
 Splicing (cis and trans)
 Duplications and inversions
 Genefinding resources (e.g. ESTs)
 Likely comparator species
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
9
Genome sequencing
Sequencing involves:
 DNA fragmenting into small pieces
 Sequence determination
 Assembly into large contiguous sequences
Problems occur:
 Cloning steps
 Bacterial transformation and amplification
 Sequencing chemistry (GC compressions,
homopolymer runs)
 Assembly of repetitive regions
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
10
Sequencing a Genome
1
2
3
4
5
6
August 2008
7
8
9
10
11
12
13
Bioinformatics Tools for Comparative Genomics of Vectors
11
Sequence coverage
Most genome sequences are not complete (not
finished). Whole Genome Shotguns are referred
to as having an X-fold coverage.
Low coverage (2x) is sufficient for gene discovery
and some regulatory element identification.
High coverage (6x) is good for gene annotation.
There will still be some missing genes.
Finished sequence has no gaps and is presumed
to contain all genes.
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
12
Sequence strategies
Sequencing technologies and strategies for genomic sequencing are
constantly changing (improving).
 Genomic clones in an ordered ‘clone by clone’ approach
 Whole Genome Shotgun (WGS)
 Traditional Sanger sequencing long reads
 New short-read technologies
 Hybrid WGS strategies
 Reduced representation WGS using short-read technologies
 Mixture of Solexa/454 reads and large-insert clone ends
» How big a piece of DNA can we assembly with confidence?
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
13
Sequencing the Human Genome
Chromosome
24
Overlapping BACs
354,510
Tiling set
29,298
4-5x shotgun sequence
& computer assembly
Draft sequence
……..TAGCTGTGTACGATGATC……….
4-5x more shotgun
Gap closure
Problem solving
i.e. “Finishing”
1 contig
Finished sequence
August 2008
~15 contigs per clone
Bioinformatics Tools for Comparative Genomics of Vectors
less than one
error in 10,000
14
Sequencing data
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
16
Output from an automated DNA sequencing machine used by the Human
Genome Project to determine the complete human DNA sequence.
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
17
Advanced Technologies
1992-1999
Sequencer: gel ABI 373/377
2 or 3 runs per day, 36 to 96 samples
100kb of information per machine per day
80 people
2000
Sequencer: capillary ABI 3700
8 runs per day, 96 samples
400kb of information per machine per day
40 people
2004
Sequencer: capillary ABI 3730xl
15/40 runs per day, 96 samples
2 Mb of information per machine per day
10 people
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
18
Sequencing by synthesis







Solexa/Illumina sequencing platform.
DNA fragments ligated with adaptors and attached to a flow cell.
Solid state amplification of the sequence (approx. 1000 fold) to form dense (less than 1
micron) spots.
Can achieve very high spot densities (up to 10 million clusters per cm2).
Use labeled reversible terminators and laser excitation to determine incorporated bases
No cloning step improves representation of the genome
No issues relating to homopolymer runs



Read lengths are short, approx. 30-40 bp
Throughput is in the order of 100 Mb per run
8 samples per flow cell
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
19
Solexa sequencing
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
20
Pyrosequencing (454)






Nebulized or adapter-ligated DNA fragments are attached to beads
PCR amplification step
Each DNA-bound bead is placed into picotiterplate where the DNA synthesis will take place
Measure incorporation of a nucleotide using the light produced via the luciferase enzyme
(nucleotide incorporation releases pyrophosphate which is converted to ATP by ATP
sulfurylase and consumed by luciferase producing light).
However, the signal strength for homopolymer stretches is linear only up to eight
consecutive nucleotides after which the signal falls-off rapidly
Can deal with high GC composition
No cloning step improves representation of the genomic sequence


Read lengths are approx. 100 bp
Throughput in currently in the order of 20 million bp per run

August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
21
Comparison of sequencing technologies
Platform
Sanger
454
Solexa
Read length (bp)
Throughput (Mb)
500-800
~100
†
~30
Cost (cent/base)
~ 0.1
20
†
~100
1
0.1
0.0001
† New FLX upgrades should increase read lengths to 300bp
and throughput to approximately 100 MB
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
22
New technologies need new assembly
algorithms
Just as the the transition from ‘clone by clone’ approach to Whole Genome Shotgun spawned
new algorithms for sequence assembly the increasing use of short-read technologies requires
new assembly algorithm developments
Genomics clones (30-300 kb)
 Phrap
Chromosomes/Genomes using Sanger long-read technologies (<1000 Mb)
 TIGR assembler
 ARACHNE
 JAZZ
 PCAP
 Phusion
Genomes using short-read technologies (< 10 Mb)
 Velvet
 SHARCGS
 AbySS
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
23
Some terminology
 N50
Measure of genome assembly quality. The N50 value is defined as a value for which 50%
of the sequenced nucleotides are represented in groups with length greater than this
value. Commonly two N50 values are quoted:
N50 contig length - a measure of how well individual reads assemble
N50 supercontig length - a measure of the general quality of the assembly
 Contig
Single contiguous section of DNA (a set of overlapping DNA segments derived from a
single genetic source)
 Supercontig (or scaffold)
Ordered (and orientated) assembly of contigs
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
24
High-throughput technology leads to lower quality
assembled genomes
 Few genomes are completely sequenced. The completion and quality assurance
needed for bacterial genomes is expensive, for larger eukaryotes even more so.
 ‘Finishing’ is the process by which a WGS shotgun assembly is completed
(determine the sequence from any physical or sequence gaps) and further polished
to remove ambiguities in the base calls and attempt to accurately reflect repetitive
regions.
 New sequencing technologies provide better representation of the genome (by
removing cloning steps) and deeper coverage but are harder to assemble because
of the short-read lengths.
 People now talk about the ‘accessible’ genome for a species. This simply means
the output from a reasonably deep sequence shotgun after assembly and limited
(mainly computational) processing and improvements.
» Trade off between throughput and product quality.
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
25
Sequence substrates
 What is the product of a genome assembly?
 What is starting material for a genome annotation?
 Completed chromosome/genome
 Genomic clones
 Ordered supercontigs
 Unordered supercontigs
 Clustered EST sequences†
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
27
Sequencing substrates
Chromosome
Genomic clones
Supercontigs
Contigs
Unordered
supercontigs
August 2008
Clustered
ESTs
Bioinformatics Tools for Comparative Genomics of Vectors
28
Genome sequencing
Annotation quality depends on:
 Fragmentation of assembly
 Sequencing errors
 Poorly represented sequence regions
 Extensive simple repeat sequences
 Large number of transposon sequences
 Haplotype problems
 Contaminants (e.g. bacterial or viral sequences)
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
29
Genome annotation - the goal!



Defining important features of the genome sequence
Labelling/describing features of the genome
'Adding value' to the genome sequence


Annotation is an ongoing process
Annotation is almost always incomplete



Set of ‘Best guess’ gene predictions
Short description of the putative function for each prediction
Species/Group dependant catalog of other data types
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
31
Annotation from a genome project prospective






Initial ‘first pass’ annotation run prior to publication
Subsequent curation is a collaboration with the community
Focused on protein-coding genes
‘Best guess’ predictions
Little emphasis on transposons or pseudogenes
Predicting gene loci is more important than getting 100% correct gene
structure predictions
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
32
Manual v Automated annotation
Genes
Genes
Genes
Genes
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
34
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
35
Manual v Automated: Pros & Cons
Speed
Coverage
Accuracy
Met’s & STOPs
Reproducibility
August 2008


*




*

*
Bioinformatics Tools for Comparative Genomics of Vectors
36
Manual (re)annotation - Bridges……
“Paint the Bridge”
 Classic “First-pass” annotation strategy
 Annotate genomic regions by walking through the chromosome/clone/slice
 Comprehensive but slow to deal with problem genes
“Painting by numbers”
 Identify problem genes by scripts to generate lists for manual appraisal
 Responsive to community submissions but only as good as the list generation script
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
37
Automated (re)annotation: Ensembl
Ensembl builds the bridge anew with each gene build
 Responsive to new data
 Questions of prediction “churn”
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
38
Manual v Automated approaches

Involvement of the community to improve gene prediction accuracy and
functional calls

Moderated submissions - (WormBase, FlyBase)
 Integration time is dependent on database release cycles

Direct submissions - (VectorBase)
 Presentation via DAS onto genome browser
 Moderated before integration
 Integration time is relatively slow

Indirect submissions - (EMBL/GenBank/DDBJ)
 Submissions to public nucleotide databases will get reflected in the genome
annotation - eventually!
 Processed to protein databases and then integrated
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
39
Genome annotation - building a pipeline
Genome sequence
Map repeats
Map ESTs
Map Peptides
Genefinding
nc-RNAs
Protein-coding genes
Functional annotation
Release
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
40
Genome annotation - predicting genes
Blessed predictions
Manual annotations
Community submissions
(Apollo)
(Genewise, Exonerate, Apollo)
Species-specific predictions
Similarity predictions
(Genewise)
(Genewise)
ncRNA predictions
Canonical
predictions
Protein family HMMs
(Genewise)
(Rfam)
Transcript based predictions
ab initio gene predictions
(Exonerate)
(SNAP)
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
41
Annotation
August 2008
Bioinformatics Tools for Comparative Genomics of Vectors
42