Download 3. Sequence preprocessing

Document related concepts

Promoter (genetics) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Deoxyribozyme wikipedia , lookup

DNA barcoding wikipedia , lookup

Genome evolution wikipedia , lookup

DNA sequencing wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Molecular ecology wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Non-coding DNA wikipedia , lookup

Exome sequencing wikipedia , lookup

Molecular evolution wikipedia , lookup

Genomic library wikipedia , lookup

RNA-Seq wikipedia , lookup

Homology modeling wikipedia , lookup

Whole genome sequencing wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Transcript
Most pipelines work the same way!
Metagenomics Processing
Pre
pro
ces
sin
g
Taxonomic assignments
Contamination removal
Bin
nin
g
rea
ds
Metagenomics
Quality control –
Prinseq
Statistics

Deconseq
Population genomes

Annotation

STAMP
crAss

FOCUS
metabat

Real time
metagenomics
ContigClustering

mg-rast

Super FOCUS
Metagenomics Processing
Contig clustering
Preprocessing
AbundanceBin
CompostBin
concoct
crAss
tetra
FASTQC
FastX Toolkit
fitGCP
NGS QC Toolkit
Non-pareil
Prinseq
QC-Chain
Streaming Trim
Taxonomic assignment
CARMA
FOCUS
KRAKEN
LMAT
MEGAN
Metaplan
myTaxa
PhylopythiaS
phymmbl
RAIphy
TACOA
Taxy
Gene Prediction
FragGeneScan
GlimmerMG
MetaGeneAnnotator
MetaGeneMark
MetaGun
Orphelia
Prodigal
Functional assignment
CLAMS
Sequedex
DiScRIBinATE SORTITEMS
genometa
SPANNER
GSMer
SPHINX
PPLACER
TaxSOM
RTMg
Treephyler
Bad data analysis
Preprocessing Data
Rob Schmieder
Good data analysis
New
dataset
Quality control
& Preprocessing
Assembly
Similarity
search
3 Tools for metagenomic data
http://prinseq.sourceforge.net
http://tagcleaner.sourceforge.net
http://deconseq.sourceforge.net
Quality control and data preprocessing
http://edwards.sdsu.edu/prinseq
Rob Schmieder
Number and length of sequences
Bad
Good
Reads should
be approx.
same length
(same number
of cycles)
→ Short reads
are likely lower
quality
Linearly degrading quality across the read
Trim low quality ends
High quality throughout the sequence
Good quality through
the length of the sequence
Sequence quality
falls off quickly
→ Bad sequence data
Ion quality scores
Low quality sequence issues


Most assemblers or aligners do not take into
account quality scores
Errors in reads complicate assembly, might
cause misassembly, or make assembly
impossible
What if quality scores are not available ?
Alternative:



Infer quality from the percent of Ns found in the
sequence
Removes regions with a high number of Ns
Huse et al. found that presence of any ambiguous
base calls was a sign for overall poor sequence
quality
Huse et al.: Accuracy and quality of massively parallel

DNA pyrosequencing. Genome Biology (2007)
What if quality scores are not available ?
Alternative:



Infer quality from the percent of Ns found in the
sequence
Removes regions with a high number of Ns
Huse et al. found that presence of any ambiguous
base calls was a sign for overall poor sequence
quality
Huse et al.: Accuracy and quality of massively parallel

DNA pyrosequencing. Genome Biology (2007)
Ambiguous bases


If you can afford the loss, filter out all reads
containing Ns
Assemblers (e.g. Velvet) and aligners
(SHAHA2, BWA, …) use 2-bit encoding
system for nucleotides
– some replace Ns with random base, some with
fixed base (e.g. SHAHA2 & Velvet = A)
2-bit example: 00 – A, 01 – C, 10 – G, 11 - T
Quality filtering


Any region with homopolymer will tend to have
a lower quality score
Huseet al. found that sequences with an
average score below 25 had more errors than
those with higher averages
Huse et al.: Accuracy and quality of massively

parallel DNA pyrosequencing. Genome Biology (2007)
Sequence duplicates
Real or artificial duplicate ?




Metagenomics = random sampling of genomic
material
Why do reads start at the same position?
Why do these reads have the same errors?
No specific pattern or location on sequencing
plate


Gomez-Alvarez et al.: Systematic artifacts in metagenomes from
complex microbial communities. ISME (2009)
One micro-reactor – Many beads
Martine Yerle (Laboratory of Cellular Genetics, INRA, France)
Impacts of duplicates

False variant (SNP) calling

Require more computing resources
– Find similar database sequences for same query
sequence
– Assembly process takes longer
– Increase in memory requirements

Abundance or expression measures can be
wrong
Impacts of duplicates

False variant (SNP) calling

Require more computing resources
Reference
...ACCACACGTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...
– Find similar database
sequences for same query
GTGTTGTGTACATGAACACAGTATATGAGCATACAGAT...
sequence
GTGTACATGAACACAGTATATGAGCATACAGAT...
– Assembly process takes longer
TGAACACAGTCTATGAGCATACAGAT...
– Increase in memory requirements
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...
TGAACACAGTCTATGAGCATACAGAT...

Abundance or expressionTGAACACAGTCTATGAGCATACAGAT...
measures can be
wrong
Impacts of duplicates

False variant (SNP) calling

Require more computing resources
– Find similar database sequences for same query
sequence
– Assembly process takes longer
– Increase in memory requirements

Abundance or expression measures can be
wrong
Detect and remove tag sequences
http://edwards.sdsu.edu/tagcleaner
No tag
MID tag
WTA tags
Imperfect primer annealing
Fragment-to-fragment concatenations
Data upload
Tag sequence definition
Tag sequence prediction
Parameter definition
Download results
Identification and removal of sequence contamination
http://edwards.sdsu.edu/deconseq
Contaminant identification



Previous methods had critical limitations
Dinucleotide relative abundance uses information
content in sequences  can not identify single
contaminant sequences
Sequence similarity seems to be only reliable
option to identify single contaminant sequences
– BLAST against human reference genome is slow and
lacks corresponding regions (gaps, variants, …)
– Novel sequences in every new human genome
sequenced*
* Li et al.: Building the sequence map of the human

pan-genome. Nature Biotechnology (2010)
Faster algorithms for Next-gen data
Principal component analysis (PCA) of
dinucleotide relative abundance
Microbial metagenomes
Viral metagenomes
Contaminant identification



Current methods have critical limitations
Dinucleotide relative abundance uses information content
in sequences  can not identify single contaminant
sequences
Sequence similarity seems to be only reliable option to
identify single contaminant sequences
– BLAST against human reference genome is slow and
lacks corresponding regions (gaps, variants, …)
– Novel sequences in every new human genome
sequenced*
* Li et al.: Building the sequence map of the
human pan-genome. Nature Biotechnology (2010)
DeconSeq web interface
Two types of reference databases
Remove
Retain
DeconSeq web interface (cont.)
DeconSeq
Identity =
How similar is the
query sequence
to the
reference sequence
Coverage =
How much of query sequence
is similar to reference sequence
DeconSeq
Blue = More similar to “retain”
Red = More similar to “remove”
Human DNA contamination identified in
145 out of 202 metagenomes
Pairing Data
Two types of paired ends

Mate pairs

Paired end reads
Repeats
A
B
Paired end reads or mate pairs
C
Mate pair sequencing
Mate pair Sequencing
Add linkers
Mate pair sequencing
Nick
Sequencing
migration
Paired end sequencing
Tagmentation
Biological fragmentation using a transposon
and discontinuous DNA primers
Covaris ultrasonicator
Physical fragmentation of
DNA using sonication
Paired end sequencing
Paired end sequencing
Short reads
Long reads
Joining Paired Ends
Joining paired ends

Counting abundance:


Join, but keep one end of singletons
Assembling

Do not join (assembler will do it)