Download BIN-2002

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene therapy wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Genetic code wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene nomenclature wikipedia , lookup

Copy-number variation wikipedia , lookup

Oncogenomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

History of RNA biology wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Mitochondrial DNA wikipedia , lookup

Genomic imprinting wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NUMT wikipedia , lookup

Gene desert wikipedia , lookup

Primary transcript wikipedia , lookup

Point mutation wikipedia , lookup

Genetic engineering wikipedia , lookup

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Non-coding RNA wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Transposable element wikipedia , lookup

Gene expression profiling wikipedia , lookup

History of genetic engineering wikipedia , lookup

Public health genomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Genomic library wikipedia , lookup

Minimal genome wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Human Genome Project wikipedia , lookup

Microevolution wikipedia , lookup

Human genome wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Genome evolution wikipedia , lookup

Transcript
BCM-2002
Concepts and methods in genome
assembly and annotation
B. Franz LANG, Département de Biochimie
Bureau: H307-15
Courrier électronique: [email protected]
Outline
1.
2.
3.
4.
What is genome assembly?
What is genome annotation?
Annotating protein coding genes and introns
Prediction of RNA genes
1. What is genome assembly?
Stitching together sequence reads
into contigs (up to the complete
chromosome size) – required for
identification of complete genes
and their annotation. Assembly
provides also information on the
genome architecture (linear or
circular chromosomes, their
number etc.). Contigs may be up
to millions of nucleotides in size.
An average read coverage >10 is
required for decent assemblies.
Long reads or paired-end reads of
long DNA fragments permit
linking contigs bordered by repeat
regions (true links, or scaffolds
with missing sequence of
predicted length NNNNN…).
Genome assembly at a higher level
•
•
•
•
Genomes may be DNA or RNA (double- or single-stranded)
An organism may have several genomes (e.g. in eukaryotes, organelles)
Genomes may consist of more than one physical unit (chromosomes)
Chromosomes may be circular, monomeric linear, directly repeated
several times and linear (e.g., product of rolling circle replication, which
appears as circular-mapping in sequence assembly !)
Circular-mapping concatamers,
only replicative form is truly circular
How genome assembly of real (dirty) data works
Given sequence read information (Sanger, Illumina, PacBio …) an
algorithm is required to combine more or less perfectly overlapping
sequence into a genome sequence
• Overlap-join procedures. Slow, but allow use of error-prone
sequencing technologies like 454, which in turn may introduce error
into the assembly (e.g., frameshifts with 454).
Examples of software – Phrap, Consed, Newbler, Mira.
• Eulerian algorithms based on graphs. Very fast, but require reads
without sequence error or variation. Huge datasets (Illumina) can be
processed. An important feature is the use of sequence coverage across
the graph, for removal of assembled regions due to experimental
error, contaminant reads from other genomes etc. Examples of
software - Velvet, SOAPdenovo, Celera, Abyss, Allpath, Spades.
Graph algorithms for assembly
(a) Sequence, (b) Traditional
assembly, walk through
Hamiltonian cycle.
Variant in (c), after split of
reads into short k-mers (ex.
3. (d) Modern de Bruijn
graph finding sequence more
quickly via Eulerian cycle.
P. Compeau, P. Pevzner & G. Tesler (2011) NATURE BIOTECHNOLOGY 2 9: 987-91
How genome assembly of real (dirty) data works
Given sequence read information (Sanger, Illumina, PacBio …) an
algorithmic approach is required to:
• Discard information from contaminating DNA, primers and adapters
• If at low level, sequence coverage cut-off will resolve the issue
• Resolve repeat regions of all kinds that constitute assembly conflicts
•
•
•
•
Mobile genetic elements and other short repeated DNA segments
Segmental genome duplication
Diploid, aneuploid … genomes with sequence differences in allels (‘snips’)
Whole genome duplication followed by genetic drift of one copy or its partial
loss.
This requires sequence from large DNA fragments, chromosome size mapping,
other physical or biological genome information
How genome assembly of real data works
• Resolve chromosome architecture (multiple genomes and chromosomes,
linear, circular, or circular-mapping concatamers)
An issue that usually needs manual input of an expert who has additional
molecular information
2. What is genome annotation?
Finding and precise positional prediction of all genes, other genetic
elements, insertion elements and repeats, on a given genome sequence
• Species may contain more than one genome (e.g., nuclear, mitochondrial,
chloroplast, virus/phage, plasmid …)
• The genetic code and gene expression signals may differ from one genome
to another - needs info on gene expression at the RNA and/or protein level
• Genes may be contiguous, or disrupted by introns, as well as discontinuous
(trans-spliced or in pieces).
• Based on comparative gene/intron predictions (gene models, bioinformatic
inference); information on transcript and protein sequence and other
biological facts (e.g., enzymatic or genetic studies) is usually required
• List of features at the sequence level (e.g., GenBank submission file)
• Genetic maps
What is genome annotation?
COMMENT
Complete mitochondrial genome.
FEATURES
Location/Qualifiers
source
1..70800
/organism="Glomus irregulare"
/organelle="mitochondrion"
/mol_type="genomic DNA"
/strain="DAOM-197198"
/type="genomic"
gene
465..6402
/gene="rnl"
rRNA
join(465..1120,1565..1634,2692..3053,3455..4262,
4582..4624,4989..5658,5998..6402)
/gene="rnl"
/product="large subunit ribosomal RNA"
exon
465..1120
/gene="rnl"
/number=1
intron
1121..1564
/gene="rnl"
/note="Group IA3"
/number=1
exon
1565..1634
/gene="rnl"
/number=2
intron
1635..2691
/gene="rnl"
/note="Group IA3"
/number=2
gene
1976..2584
/gene="orf202“
CDS
1976..2584
/gene="orf202"
/codon_start=1
/transl_table=4
/product="hypothetical protein"
/translation="MKSPNPQPALSSIQREILVGGLLGDLSIYRAKVTHNARLYVQQG
SVHKEYLNHLYSVFQNLCSSEPKWSLSLDKRSNTTYETLRFNSRSLPCFNYYRDVFYP
EGVKIVPANIGELLTARGLAYWSMDDGYKDRGNFRLATQSFSRNDVLLLIKLLKDNFS
LDCSLNTVKSTQYRIYVRANSMVQFRALVSPYFHPSMLYKLQ"
exon
2692..3053
/gene="rnl"
/number=3
intron
3054..3454
/gene="rnl"
/note="Group IB"
/number=3
… and so on …
Continued to the right …
Example: Partial GenBank annotation of a mitochondrial genome
(rRNA gene with introns and a predicted protein coding sequence)
11
What is genome annotation?
Example: Genetic maps of two mitochondrial genomes
12
Effect of completeness, sequencing error, and assembly
artifacts on genome annotation
Incomplete genome assembly (‘draft genome’)
•
annotation of genes and genetic elements are somewhat incomplete – still
works for bulk of gene identification, expression studies and comparisons
Systematic sequence error (technology-specific)
•
•
•
•
454, number of nucleotides in homopolymer sequences incorrect – causes
difficulties in genome assembly at these sites, and potential frame-shifting in
protein coding genes that may therefore remain unidentified
Sanger, difficulties to resolve snap-back structures; termination and/or
slippage at long homopolymers - same as above but less severe, less in genes
Illumina, uncertain sequence at certain sequence motifs such as GGCNN –
seems to be less with latest technology. Error prediction and correction is
possible.
Ion Torrent, Pacific Biosciences – overall high error rate, may to some
degree be corrected by using very deep coverage (fails if polymorphic
sites/snips are of interest; errors and snips are hard to distinguish)
3. Annotation of protein coding genes and introns
• First, one needs to know, or infer, the genetic code
• Translate Open Reading Frames (ORFs) that are not
interrupted by a stop codon and that start with a know initiation
codon (ATG, GTG …)
• ORFs may be given a functional identity, by sequence
comparison to known genes. Protein sequence data can be used
to confirm factual translation and identification of the genetic
code.
3. Annotation of protein coding genes and introns
• Transcription data for the gene region as well as the presence of
regulatory elements help to confirm the prediction (in case of bacteria,
ribosomal binding site at 5’; terminator sequence at 3’; upstream
promoters …);
• If these genes contain introns, exons may be identified in two ways
– By comparing the gene region with transcript sequences (do not contain introns)
– Inference of exon-intron structure based on sequence similarities of exons, intron
features such as conserved splice site motifs, as well as any other feature that is
known to define a gene in a given group of organisms. ‘Gene models’ and ‘intron
models’.
3. Annotation of protein coding genes and introns
If genes contain introns, exon/intron boundaries (nucleus, eukaryotes) may be identified by
conserved splice site motifs (intron models). For other intron types, use respective models.
3. Annotation of protein coding genes and introns
M Yandell and D. Ence (2012) NATURE REVIEWS | GENETICS 13: | 329
3. Annotation of protein coding genes and introns
M Yandell and D. Ence (2012) NATURE REVIEWS | GENETICS 13: | 329
4. Prediction of structured RNA genes:
a comparison of RNAmotif with ERPIN
Features of structured RNAs:
•
•
•
•
primary sequence conservation
secondary structure
tertiary interactions
site-wise conservation may be highly variable – not similar
enough to find with Blast
… follows example from RNase P RNA …
Mitochondrial RNase P RNA is highly conserved in pairing P4, the reactive center of the
molecule, with respect to its bacterial counterparts. Yet, even the conserved sequence
motifs (red) very too much in most of the known genes that they can be identified with
Blast.
Examples of rnpB gene sequences in yeast mitochondria.
How to search most effectively for
mitochondrial RNase P RNAs ?
Method 1: search conserved primary sequence
motif only, using regular sequence expressions
Mitochondrial primary consensus sequence – most conservation is close to P4
P4 – helical interaction
Corresponding regular expression:
[AT]G[GA]NAA[GA]T[TC][ATC][GT][GA] ... 73 - 386 ...
… A[CT][AU]NAAN[ATC][TC][AC][GAT][GT][CT]TTA[GAT]
As it turns out, primary sequence conservation is weak, and just ~50% of
currently known sequences are found with this information.
How to search most effectively for
mitochondrial RNase P RNAs ?
Method 2: Use both conserved primary sequence plus secondary
structure, united in a structural profile that is translated into an
RNAmotif ‘descriptor’
Structured sequence profile including P4 helical region
(using more sequences than in the primary sequence example)
Translation of this complex structural motif into an RNAmotif descriptor
parms
wc +=gu;
### finds mt RNase P RNAs
### permits global GU
descr
ss(len=20)
ss(len=5, seq="[GAT][AT]G[GAT]A$")
h5(len=3, seq="A[GA][GA]",mispair=1,ends='mm')
ss(len=1,seq="T$")
h5(len=5,seq="[TC][ATC][GAT][GAT].$",mispair=2,ends='mm')
ss(minlen=50, maxlen=1000,seq="[AC]C[ATC].[GA]A$")
h3 (seq="[ATC][ATC][ATC][GAT][GT]$")
h3 (seq="[GTC][TC]T$")
ss(len=1,seq="A")
ss(len=20)
###
###
###
###
###
###
###
###
###
###
20 flanking nucleotides
ss 5' to structure
P4-1
T bulge
P4-2
P4 loop
P4-1'
P4-2'
universal A
20 flanking nt
It finds four false positives in a collection of 9 mtDNAs with RNase P RNA, and
misses one solution: lack of both sensitivity and specificity.
How to search most effectively for
mitochondrial RNase P RNAs ?
Method 3: Use both conserved primary sequence plus secondary
structure, united in a training set with all known sequences
aligned, plus a corresponding structural line:
to be used for ERPIN searches.
Translate the structural alignment into the ERPIN format
… however, it is ‘a bit’ cryptic …
1
2
3 4
5
6
7
5
3 8
9
The GDE editor comes to help, with color coding, coupled to a tool
that translates the alignment into ERPIN format
ERPIN then calculates RNA primary and secondary structure
profiles from the sequence alignment that are matched to the
target sequence. Probabilistic search taking into account
nucleotide frequencies.
Much of the algorithm’s efficiency stems from the use of userdefined, precisely delimited structural elements that can be
searched individually or in combination, and by the option to
use a defined search order (‘search strategy’).
ERPIN results
Note the E-values, the probability that a given structural motif occurs by chance in a target
database of given size and nucleotide composition. Values of 1e-2 and smaller can be
already considered ‘safe’ matches, although solutions close to 1e+1 might also be
considered. Results are much superior to RNAmotif: few if any false positives; some
degree of sequence variance is tolerated – finds deviant sequences.
A recent, even more powerful probabilistic approach has
become available, called Infernal (Sean Eddy). It uses
primary sequence plus covariance/HMM-like inferences that
provide slightly better E-values than ERPIN. A large variety
of specific search models are available via the publicly
available RFAM database, useful for genome annotation.
http://rfam.xfam.org/
Rfam 11.0: 10 years of RNA families.
S.W. Burge, J. Daub, R. Eberhardt, J. Tate, L. Barquist, E.P. Nawrocki,
S.R. Eddy, P.P. Gardner, A. Bateman.
Nucleic Acids Research (2012)
How much structural conservation is required for
meaningful ERPIN searches?
Example: T-stem plus T-loop of tRNAs, to find
matches with E-value better than 5e-2
Results: few if any false positives even in large datasets